LEARNING ONE-HIDDEN-LAYER NEURAL NETWORKS ON GAUSSIAN MIXTURE MODELS WITH GUARAN

Abstract

We analyze the learning problem of fully connected neural networks with the sigmoid activation function for binary classification in the teacher-student setup, where the outputs are assumed to be generated by a ground-truth teacher neural network with unknown parameters, and the learning objective is to estimate the teacher network model by minimizing a non-convex cross-entropy risk function of the training data over a student neural network. This paper analyzes a general and practical scenario that the input features follow a Gaussian mixture model of a finite number of Gaussian distributions of various mean and variance. We propose a gradient descent algorithm with a tensor initialization approach and show that our algorithm converges linearly to a critical point that has a diminishing distance to the ground-truth model with guaranteed generalizability. We characterize the required number of samples for successful convergence, referred to as the sample complexity, as a function of the parameters of the Gaussian mixture model. We prove analytically that when any mean or variance in the mixture model is large, or when all variances are close to zero, the sample complexity increases, and the convergence slows down, indicating a more challenging learning problem. Although focusing on one-hidden-layer neural networks, to the best of our knowledge, this paper provides the first explicit characterization of the impact of the parameters of the input distributions on the sample complexity and learning rate.

1. INTRODUCTION

Deep neural networks (LeCun et al., 2015) have demonstrated superior empirical performance in various applications such as speech recognition (Krizhevsky et al., 2012) and computer vision (Graves et al., 2013; He et al., 2016) . Despite the numerical success, the theoretical underpin of learning neural networks is much less investigated. One bottleneck for the wide acceptance of deep learning in critical applications is the lack of the theoretical generalization guarantees, i.e., why a model learned from the training data would achieve a high accuracy on the testing data. This paper studies the generalization performance of neural networks in the "teacher-student" setup, where the training data are generated by a teacher neural network, and the learning is performed on a student network by minimizing the empirical risk of the training data. This teacher-student setup has been studied in the statistical learning community for a long time (Engel & Broeck, 2001; Seung et al., 1992) and applied to neural networks recently (Goldt et al., 2019a; Zhong et al., 2017b; a; Zhang et al., 2019; 2020b; Fu et al., 2020; Zhang et al., 2020a) . Assuming that the student network has the same architecture as the teacher network, the existing generalization analyses mostly focus on one-hidden-layer networks, because the optimization problem is already nonconvex, and the analytical complexity increases tremendously when the number of hidden layers increases. One critical assumption of most works in this line is that the input features follow the standard Gaussian distribution. Although other distributions are considered in (Du et al., 2017; Ghorbani et al., 2020; Goldt et al., 2019b; Li & Liang, 2018; Mei et al., 2018b; Mignacco et al., 2020; Yoshida & Okada, 2019) , the generalization performance beyond the standard Gaussian input is less investigated. On the other hand, the learning performance clearly depends on the input data distribution. (LeCun et al., 1998) states that the learning method converges faster if the inputs are whitened to be the standard Gaussian. Batch normalization (Ioffe & Szegedy, 2015) modifies the mean and variance in each layer and is a popular practical method to achieve a fast and stable convergence. Various explanations such as (Bjorck et al., 2018; Chai et al., 2020; Santurkar et al., 2018) have been proposed to explain the enormous success of Batch normalization, but little consensus exists on the exact mechanism. Contributions: This paper provides a theoretical analysis of learning one-hidden-layer neural networks when the input distribution follows a Gaussian mixture model containing an arbitrary number of Gaussian distributions with arbitrary mean and variance. The Gaussian mixture model has been employed in many applications such as data clustering and unsupervised learning (Dasgupta, 1999; Figueiredo & Jain, 2002; Jain, 2010) , and image classification and segmentation (Permuter et al., 2006) . The parameters of the mixture model can be estimated from data by the EM algorithm (Redner & Walker, 1984) or the moment-based method (Hsu & Kakade, 2013) , with theoretical performance guarantees, see, e.g., (Ho & Nguyen, 2016; Ho et al., 2020; Dwivedi et al., 2020a; b) . For the binary classification problem with the cross entropy loss function, this paper proposes a gradient descent algorithm with tensor initialization to estimate the weights of the one-hidden-layer fully-connected neural network. Our algorithm converges to a critical point linearly, and the returned critical point converges to the ground-truth model at a rate of d log n/n, where d is the dimension of the feature, and n is the number of samples. We also characterize the required number of samples for accurate estimation, referred to as the sample complexity, as a function of d, the number of neurons K, and the input distribution. Our explicit bounds imply (1) when the absolute value of any mean in the Gaussian mixture model increases from zero, the sample complexity increases, and the algorithm converges slower, indicating that it will be more challenging to learn a model with a small test error; (2) The same phenomenon happens when any variance in the mixture model increases to infinity from a certain positive value, or if all the variances in the mixture model approach zero. Our results indicate that the training converges faster and requires a less number of samples if the input data are zero mean with a certain non-zero variance. This can be viewed as one theoretical explanation in one-hidden-layer for the success of Batch normalization. Moreover, to the best of our knowledge, this paper provides the first theoretical and explicit characterization about how the mean and variance of the input distribution affect the sample complexity and learning rate.

1.1. RELATED WORK

Learning over-parameterized neural networks. One line of theoretical research on the learning performance considers the over-parameterized setting where the number of network parameters is greater than the number of training samples. (Bousquet & Elisseeff, 2002; Hardt et al., 2016; Keskar et al., 2016; Livni et al., 2014; Neyshabur et al., 2017; Rumelhart et al., 1988; Soltanolkotabi et al., 2018; Allen-Zhu et al., 2019a) . (Allen-Zhu et al., 2019b; Du et al., 2019; Zou & Gu, 2019) show the deep neural networks can fit all training samples in polynomial time. The optimization problem has no spurious local minima (Livni et al., 2014; Zhang et al., 2016; Soltanolkotabi et al., 2018) , and the global minimum of the empirical risk function can be obtained by gradient descent (Li & Yuan, 2017; Du et al., 2018b; Zou et al., 2020) . Although the returned model can achieve a zero training error, these works do not discuss whether it achieves a small test error or not. (Allen-Zhu et al., 2019a; Li & Liang, 2018) analyze the generalization error by characterizing the training error and test error separately. Still, there is no guarantee that a learned model with a small training error would have a small test error. (Cao & Gu, 2019) provides the bounds of the generalization error of the learned model by stochastic gradient descent (SGD) in deep neural networks, based on the assumption that there exists a good model with a small test error around the initialization of the SGD algorithm, and no discussion is provided about how to find such an initialization. In contrast, our tensor initialization method in this paper provides an initialization that is close to the ground-truth teacher model such that our algorithm can find this model with a zero test error. Generalization performance with the standard Gaussian input. In the teacher-student setup of one-hidden-layer neural networks, (Brutzkus & Globerson, 2017; Du et al., 2018a; Ge et al., 2018; Liang et al., 2018; Li & Yuan, 2017; Shamir, 2018; Safran & Shamir, 2018; Tian, 2017) consider the ideal case of an infinite number of training samples so that the training and test accuracy coincide and can be analyzed simultaneously. When the number of training samples is finite, (Zhong et al., 2017b; a) characterize the sample complexity, i.e., the required number of samples, of learning one-hidden-layer fully connected neural networks with smooth activation functions and propose a gradient descent algorithm that converges to the ground-truth model linearly. (Zhang et al., 2019; 2020b) extend the analyses to the non-smooth ReLU for fully-connected and convolutional neural networks, respectively. (Zhang et al., 2020a) analyzes the generalizability of graph neural networks for both regression and binary classification problems. (Fu et al., 2020) analyzes the cross entropy loss function for binary classification problems. Compared with other common loss functions such as the squared loss, the cross entropy loss function is harder to analyze due to the complicated forms and the saturation phenomenon of its Gradient and Hessian (Fu et al., 2020) . Theoretical characterization of learning performance from other input distributions. (Du et al., 2017) considers rotationally invariant distributions, but the results only apply to a perceptron (i.e., a single-node network). (Mei et al., 2018b) analyzes the generalization error of one-hidden-layer neural networks in the mean-field limit trained on a large class of distributions, including a mixture of Gaussian distributions with the same mean. The results only hold in the high-dimensional region where both the number of neurons K and the input dimension d are sufficiently large, and no sample complexity analysis is provided. (Li & Liang, 2018) studies the generalization error of over-parameterized one-hidden-layer networks when the data come from mixtures of well-separated distribution, but the separation requirement excludes Gaussian distributions and Gaussian mixture models. (Yoshida & Okada, 2019) analyzes the Plateau Phenomenon that the decrease of the risk slows down significantly partway and speeds up again in one-hidden-layer neural networks with inputs drawn from a single Gaussian with an arbitrary covariance. (Goldt et al., 2019b; 2020) analyze the dynamics of learning one-hidden-layer networks with SGD when the inputs are drawn from a wide class of generative models. (Mignacco et al., 2020) provides analytical equations for SGD evolution in a perceptron trained on the Gaussian mixture model. (Ghorbani et al., 2020) considers inputs with low-dimensional structures and compares neural networks with kernel methods. Notations: Vectors are in bold lowercase, matrices and tensors in are bold uppercase. Scalars are in normal fonts. For instance, Z is a matrix, and z is a vector. z i denotes the i-th entry of z, and Z i,j denotes the (i, j)-th entry of Z. [K] (K > 0) denotes the set including integers from 1 to K. I d ∈ R d×d and e i represent the identity matrix in R d×d and the i-th standard basis vector, respectively. We use δ i (Z) to denote the i-th largest singular value of Z. A 0 means A is a positive semi-definite (PSD) matrix. The gradient and the Hessian of a function f (W ) are denoted by ∇f (W ) and ∇ 2 f (W ), respectively. The outer product of vectors z i ∈ R ni , i ∈ [l], is defined as T = z 1 ⊗ • • • × z l ∈ R n1×•••×n l with T j1•••j l = (z 1 ) j1 • • • (z l ) j l . Given a tensor T ∈ R n1×n2×n3 and matrices A ∈ R n1×d1 , B ∈ R n2×d2 , C ∈ R n3×d3 , the (i 1 , i 2 , i 3 )-th entry of the tensor T (A, B, C) is given by n1 i 1 n2 i 2 n3 i 3 T i 1 ,i 2 ,i 3 A i 1 ,i1 B i 2 ,i2 C i 3 ,i3 . (1) We follow the convention that f (x) = O(g(x)) (or Ω(g(x), Θ(g(x))) means that f (x) increases at most, at least, or in the order of g(x), respectively.

2. PROBLEM FORMULATION

We consider a one-hidden-layer fully connected neural network where all the weights in the second layer have the same fixed value. This structure is also known as the committee machine, see, e.g., (Aubin et al., 2018; Monasson & Zecchina, 1995; Schwarze & Hertz, 1992; 1993) . Let x ∈ R d denote the input features. Let K ≥ 1 be the number of neurons in the hidden layer. Following the teacher-student setup, see e.g., (Fu et al., 2020) , the output labels are generated by a teacher neural network with unknown ground-truth weights w * j ∈ R d (j ∈ [K]). Let W * = [w * 1 , ..., w * K ] ∈ R d×K contain all the weights. Let δ i (W * ) denote the i-th largest singular value of W * . Let κ = δ1(W * ) δ K (W * ) , and define η 1 = K i=1 δi(W * ) δ K (W * ) . The nonlinear activation function here is the sigmoid function φ(x) = 1 1+exp(-x) . We consider binary classification, and the binary output y is generated by the teacher committee machine through P(y = 1|x) = H(W * , x) := 1 K K j=1 φ(w * j x). Learning is performed over a student neural network that has the same architecture as the teacher network, and its weights are denoted by W ∈ R d×K . Given n pairs of training samples {x i , y i } n i=1 , the empirical risk function is f n (W ) = 1 n n i=1 (W ; x i , y i ) where (W ; x i , y i ) is the cross-entropy loss function, i.e., (W ; x i , y i ) = -y i • log(H(W , x i )) -(1 -y i ) • log(1 -H(W , x i )). (4) To estimate W * from training samples, we solve the following nonconvex minimization problem min W ∈R d×K f n (W ). (5) Here we assume the input features x i are generated i.i.d. from the Gaussian mixture model (Pearson, 1894; Titterington et al., 1985; Hsu & Kakade, 2013) , which we denote as x ∼ L l=1 λ l N (µ l , σ 2 l I d ), where N denotes the multi-variate Gaussian distribution with mean µ l ∈ R d , and covariance σ l I d for σ l ∈ R + for all l ∈ [L] . The Gaussian mixture model can be viewed as x := µ h + z h ∈ R d (7) where h is a discrete random variable with Pr(h = l) = λ l for l ∈ [L], and z l follows the multivariate Gaussian N (0, σ 2 l I d ) with zero mean and covariance σ 2 l I d 1 . If the Gaussian mixture model is symmetric, the symmetric distribution can be written as x ∼          L 2 l=1 λ l N (µ l , σ 2 l I d ) + N (-µ l , σ 2 l I d ) L is even λ 1 N (0, σ 2 1 I d ) + L-1 2 l=2 λ l N (µ l , σ 2 l I d ) + N (-µ l , σ 2 l I d ) L is odd (8) We assume without loss of generality that µ l belongs to the column space of W * for all l ∈ [L]. To see this, note that an arbitrary µ l can be written as µ l + µ l⊥ , where µ l belongs to the column space of W * , and µ l⊥ is perpendicular to the column space. Then, from (2) and ( 7) we have H(W * , x) = 1 K K j=1 φ(w * j (µ h + µ h⊥ + z h )) = 1 K K j=1 φ(w * j (µ h + z h )) = H(W * , x ) (9) where x ∼ L l=1 λ l N (µ l , σ 2 l I d ). Thus, these two cases are equivalent.

3. PROPOSED LEARNING ALGORITHM

We propose Algorithm 1 to solve (5) and defer its theoeretical analysis to Section 4. The method starts from a initialization W 0 ∈ R d×K computed based on the tensor initialization method (Subroutine 1) and then updates the iterates W t using gradient descent with the step size η 0 . To analyze the general cases, we assume an i.i.d. zero-mean noise {ν i } n i=1 ∈ R d×K with bounded magnitude |(ν i ) jk | ≤ ξ (j ∈ [d], k ∈ [K] ) for some ξ ≥ 0 when computing the gradient of the loss in (4). Our tensor initialization method is extended from (Janzamin et al., 2014) and (Zhong et al., 2017b) . The idea is to compute quantities (M j in (10)) that are tensors of w * i and then apply the tensor decomposition method to estimate w * i . Because M j can only be estimated from training samples, tensor decomposition does not return w * i exactly but provides a close approximation. Because the existing method only applies to the standard Gaussian, we exploit the relationship between probability density functions and tensor expressions developed in (Janzamin et al., 2014) to design tensors suitable for the Gaussian mixture model. Formally, 1 One can easily extend our analysis to the case when the covariance is diag(σ 2 l1 , • • • , σ 2 ld ). One needs to revise Property 4 and Lemma 7 correspondingly. We use the same σ l to simplify the presentation.

Algorithm 1 Our proposed learning algorithm

Input: Training data {(x i , y i )} n i=1 , the step size η 0 = O 1 L l=1 λ l ( µ l ∞+σl ) 2 , iteration T Initialization: W 0 ← Tensor initialization method via Subroutine 1 Gradient Descent: for t = 0, 1, • • • , T -1 W t+1 = W t -η 0 • 1 n n i=1 (∇l(W , x i , y i ) + ν i ) = W t -η 0 ∇f n (W ) + 1 n n i=1 ν i Output: W T Definition 1 Let p(x) = L l=1 λ l (2πσ l ) -d 2 exp(-||x-µ l || 2 2σ 2 l ) be the probability density function of the Gaussian mixture model in (6). We define M j := E x∼ L l=1 λ l N (µ l ,σ 2 l I) [y • (-1) j p -1 (x)∇ (m) p(x)], j = 1, 2, 3 Let α ∈ R d denote an arbitrary vector. If the Gaussian Mixture Model is symmetric as in (8), then P 2 := M 3 (I d , I d , α). Otherwise, P 2 := M 2 . M j is a jth-order tensor of w * i , e.g., M 3 = 1 K K i=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I) [φ (w * i x)]w * i ⊗3 . These quantifies cannot be directly computed from (10) but can be estimated by sample means, denoted by M i (i = 1, 2, 3) and P 2 , from samples {x i , y i } n i=1 . The following assumption guarantees that these tensors are nonzero and can thus be leveraged to estimate W * . Assumption 1 The Gaussian Mixture Model in (6) satisfies the following conditions: 1. E x∼ L l=1 λ l N (µ l ,σ 2 l I) [φ (w * i x)] = 0 for i ∈ [K], which implies that M 3 is nonzero. 2. If the distribution is not symmetric, then E x∼ L l=1 λ l N (µ l ,σ 2 l I) [φ (w * i x)] = 0 for i ∈ [K] , which implies M 2 and P 2 in this case are nonzero. Note that Assumption 1 is a very mild assumptionfoot_0 . Moreover, as indicted in (Janzamin et al., 2014) , in the rare case that some quantities M i (i = 1, 2, 3) and P 2 are zero, one can construct higher-order tensors in a similar way as in Definition 1 and then estimate W * from higher-order tensors. Subroutine 1 estimates the direction and magnitude of w * j , j ∈ [K], separately. The key steps are as follows. We first use the power method to decompose P 2 to approximate the subspace spanned by {w * 1 , w * 2 , • • • , w * K }, denoted by U . Then, we project M 3 ∈ R d×d×d to R 3 ∈ R K×K×K using U to reduce the computational and sample complexity for decomposing a third-order tensor in the next step. We then apply the KCL algorithm to decompose R 3 into vectors v i . Note that U vi = s i w * i , where s i ∈ {1, -1} is a random sign. Then the direction of w * j is determined. Finally, the magnitude of w * i 's and the signs of s i 's are determined by solving a linear system of equations using the RecMagSign method. Please refer to (Zhong et al., 2017b) and (Kuleshov et al., 2015) for more details on the power method, KCL and RecMagSign methods.

4. MAIN THEORETICAL RESULTS

The main idea of our analysis is to show that the empirical risk function in (3) is strongly convex in a region near W * . Then W 0 returned by Subroutine 3 is in this convex region, and the iterates returned by Algorithm 1 converge to a critical point in this region. Before formally stating our result in Theorem 1, we summarize the key implications of Theorem 1 as follow. 1. Convergence rate and estimation accuracy: When gradients are accurate (i.e., ξ = 0), the iterates W t converge to a critical point W n linearly, and the distance between W n and W * is Subroutine 1 Tensor Initialization Method Input: Partition n pairs of data {(x i , y i )} n i=1 into three subsets D 1 , D 2 , D 3 Compute P 2 using D 1 and an arbitrary vector α U ←- PowerMethod( P 2 , K) Compute R 3 = M 3 ( U , U , U ) from data set D 2 { v i } i∈[K] ←-KCL( R 3 ) {W 0 } ←-RecMagSign( U , {v i } i∈[K] , D 3 ) Return: W 0 O( d log n/n). With the noise in the gradient, there is an additional error term of O(ξ d log n/n). For example, when n is Θ(d logfoot_1 d), the estimation error decays as O( 1+ξ log d ).

2.. Sample complexity:

The sample complexity for accurate estimation is Θ(d log 2 d) where d is the feature dimension. This result is in the same order as the sample complexity for the standard Gaussian input in (Fu et al., 2020) and (Zhong et al., 2017b) , indicating that our method can handle input from the Gaussian mixture model without increasing the order of the sample complexity. Our bound is almost order-wise optimal with respect to d because the degree of freedom is dK. The additional multiplier of log 2 d results from the concentration bound in the proof technique. 3. Impact of the mean: If everything else is fixed, and at least one entry of a mean µ l(i) (the ith entry of µ l ) of the Gaussian mixture model increases from 0 (in terms of the absolute value), the sample complexity increases to infinity and the convergence slows down. The intuition is that as the absolute value of some mean increases, some training samples have significantly large magnitude such that the sigmoid function saturates. These training samples are not informative for the estimation of W * , and the gradient of these samples is close to zero. Therefore, the required number of samples to estimate W * needs to increase, and the gradient descent algorithm slows down. 4. Impact of the variance: If everything else is fixed, and at least one variance σ l of the Gaussian mixture model increases from a certain positive value, the sample complexity increases to infinity and the convergence slows down. The intuition is the same as increasing |µ l(i) | in point 3. On the other hand, when all variances in the Gaussian mixture model approach zero, the sample complexity increases to infinity, and the convergence slows down. The intuition is that when the input data are concentrated on a few vectors, the optimization problem does not have a benign landscape. Combining points 3 and 4, one can see that to learn the teacher network characterized by (2), the training samples shall have zero mean and a medium level of variance to reduce the sample complexity and speed up the convergence. If the variance is too large, some samples become non-informative and affect the learning negatively. If the variance is too small, the learning problem becomes mathematically challenging to solve. This theoretical characterization can be viewed as one motivation of the empirical techniques to improve learning rate such as whiting (LeCun et al., 1998) and Batch normalization (Ioffe & Szegedy, 2015) . We state our main theoretical result as follows. Theorem 1 Consider the binary classification problem with one hidden-layer fully connected neural network as in (2). Suppose Assumption 1 holds, then there exist 0 ∈ (0, 1 4 ) and positive value functions B(λ, M , σ, W * ) and q(λ, M , σ, W * ) such that as long as the sample size n satisfies n ≥ n sc := poly( -1 0 , κ, K)B(λ, M , σ, W * )d log 2 d, we have that with probability at least 1 -d -10 , the iterates {W t } T t=1 returned by Algorithm 1 with step size η 0 = O 1 L l=1 λ l ( µ l ∞+σl) 2 converge linearly with a statistical error to a critical point W n with the rate of convergence v = 1 -K -2 q(λ, M , σ, W * ), i.e., ||W t -W n || F ≤ v t ||W 0 -W n || F + η 0 ξ 1 -v dK log n/n, Moreover, the distance between W * and W n is bounded by || W n -W * || F ≤ O K We next quantify the impact of the parameters of the Gaussian mixture model on the sample complexity n sc and the convergence rate v discussed in Theorem 1 as follows. Corollary 1 (Impact of the Gaussian mixture model on n sc and v) (1) When everything else is fixed, n sc increases to infinity, and v increases to 1, as |µ l(i) | with any l ∈ [L] and i ∈ [d] increases, where µ l(i) is the i-th entry of µ l . (2). When everything else is fixed except for some σ l for any l ∈ [L], n sc increases to infinity, and v increases to 1, as σ l increases from ζ s for some constant ζ s > 0. (3) n sc increases to infinity, and v increases to 1 if all σ l 's go to zero for all l ∈ [L]. To the best of our knowledge, Theorem 1 provides the first explicit characterization of the sample complexity and learning rate when the input follows the Gaussian mixture model. Although we consider the sigmoid activation in this paper, our results apply to any activation function φ provided that φ is an even function, and φ, φ and φ are bounded. Examples include tanh and erf. Algorithm 1 employs a constant step size. One can potentially speed up the convergence, i.e., reduce v, by using a variable step size. We leave the corresponding theoretical analysis for future work. If we scale the weights W * = W * /c and the input feature x = cx simultaneously, the output remains the same for any nonzero constant c. Therefore, the learning problems in these two cases are equivalent in terms of the sample complexity and convergence rate. Theorem 1 reflects such equivalence. One can check that B(λ, M , σ, W * ) = B(λ, M , σ , W * ) from the proof in Section B. Similarly, the convergence rate in (12) remains the same in both cases. One main component in the proof of Theorem 1 to show that if (11) holds, the landscape of the empirical risk is close to that of the population risk in a local neighborhood of W * . (Mei et al., 2018a) quantified the similarity of these two functions when K = 1, but it is not clear if their approach can be extended to the case K > 1. Here, focusing on the Gaussian mixture model, we explicitly quantify the impact of the parameters of the input distribution on the landscapes of these functions. Please see Appendix-C for details. Compared with the analyses for the standard Gaussian in (Fu et al., 2020; Zhong et al., 2017b) , we develop new techniques in the following aspects. First, a direct extension of the matrix concentration inequalities in these works leads to a sample complexity bound of O(d 3 ), while we develop new concentration bounds to tighten it to O(d log 2 d). Second, the existing analysis to bound the Hessian of the population risk function does not extend to the Gaussian mixture model. We develop new tools that also apply to other activation functions like tanh or erf. Third, we design new tensors for the initialization, and the proof about the tensor initialization is revised accordingly. The above results assume the parameters of the Gaussian mixture are known. In practice, they can be estimated by the EM algorithm (Redner & Walker, 1984) and the moment-based method (Hsu & Kakade, 2013) . The EM algorithm returns model parameters within Euclidean distance O(( d n ) 1 2 ) when the number of mixture components L is known. When L is unknown, one usually overspecifies an estimate L > L, then the estimation error by the EM algorithm scales as O(( d n ) 1 4 ). Please refer to (Ho & Nguyen, 2016; Ho et al., 2020; Dwivedi et al., 2020a; b) for details.

5. NUMERICAL EXPERIMENTS

We verify Theorem 1 through numerical experiments. We generate a ground-truth W * ∈ R d×K from the Gaussian distribution. The training samples {x i , y i } n i=1 are generated using ( 6) and ( 2). The maximum number of iterations of Algorithm 1 is set as 12000.

5.1. TENSOR INITIALIZATION

Fig. 1 shows the accuracy of the returned model by Algorithm 1. Here d = 5, K = 2, λ 1 = λ 2 = 0.5, µ 1 = -1 and µ 2 = 0. We compare the tensor initialization with a random initialization in a local region {W ∈ R d×K : ||W -W * || F W * F ≤ }. Tensor initialization in Subroutine 1 returns an initial point close to W * with a relative error of 0.61. If the random initialization is also close to W * , e.g., = 0.1, then the gradient descent algorithm converges to a critical point from both initializations, and the linear convergence rate is the same. If the random initialization is far away, e.g., = 1.5, the algorithm does not converge. On a MacBook Pro with Intel(R) Core(TM) i5-7360U CPU at 2.30GHz and MATLAB 2017a, it takes 0.55 second to compute the tensor initialization. We consider a random initialization with = 0.1 in the following experiments to simplify the computation.

5.2. SAMPLE COMPLEXITY

Consider the case that K = 3, L = 2, λ 1 = λ 2 = 1 2 . Let µ 1 be an all one vector in R d and let  µ 2 = -µ 1 . Let σ 1 = σ 2 = 1. = λ 2 = 0.5, µ 1 = 1, µ 2 = -1, σ 1 = σ 2 = σ. The sample complexity n is set to 50000. Among different σ we test, Algorithm 1 converges fastest when σ = 1. The convergence rate slows down when σ increases to 2 or when σ decreases to 0.5. The result is consistent with our theoretical results in Section 4. We then verify the convergence rate in (12), which shows that v = 1 -Θ(K -2 ). We set We then evaluate the distance between W n returned by Algorithm 1 and W * , measured by  λ 1 = λ 2 = 0.5, µ 1 = 1, µ 2 = -1, σ 1 = σ 2 = 1. || W n - W * || F . d is 5. n ranges from 2 × 10 3 to 6 × 10 4 . σ 1 = σ 2 = 3, µ 1 = 1, µ 2 = -1. Each point in

6. CONCLUSIONS

This paper analyzes the theoretical performance guarantee of learning one-hidden-layer neural networks for binary classification when the input follows the Gaussian mixture model. We develop an algorithm that converges linearly to a model that has a diminishing difference from the ground-truth model that has guaranteed generalizability. We also provide the first explicit characterization of the impact of the input distribution on the sample complexity and convergence rate. Future works include the analysis of multiple-hidden-layer neural networks and multi-class classification. Because of the concatenation of nonlinear activation functions, the analysis of the landscape of the empirical risk and the design of a proper initialization is more challenging and requires the development of new tools.

A PRELIMINARIES

In this section, we introduce some definitions and properties that will be used in proving the main results. First we define the sub-Gaussian random variable and sub-Gaussian norm. Definition 2 We say X is a sub-Gaussian random variable with sub-Gaussian norm K > 0, if (E|X| p ) 1 p ≤ K √ p for all p ≥ 1. In addition, the sub-Gaussian norm of X, denoted X ψ2 , is defined as X ψ2 = sup p≥1 p -1 2 (E|X| p ) 1 p . Then we define the following three quantities. ρ(µ, σ) is motivated by the ρ parameter for the standard Gaussian distribution in (Zhong et al., 2017b) , and we generalize it to a Gaussian with an arbitrary mean and variance. We define the new quantities Γ(λ, M , σ, W * ) and D m (λ, M , σ) for the Gaussian mixture model. Definition 3 (ρ-function). Let z ∼ N (u, I d ) ∈ R d . Define α q (i, u, σ) = E zi∼N (ui,1) [φ (σ • z i )z q i ] and β q (i, u, σ) = E zi∼N (ui,1) [φ 2 (σ • z i )z q i ], ∀ q ∈ {0, 1, 2} , where z i and u i is the i-th entry of z and u, respectively. Define ρ(u, σ) as ρ(u, σ) = min i,j∈[d],j =i {(u 2 j + 1)(β 0 (i, u, σ) -α 0 (i, u, σ) 2 ), β 2 (i, u, σ) - α 2 (i, u, σ) 2 u 2 i + 1 } (14) Definition 4 (Γ-function). With ( 6), ( 14) and κ, η defined in Section 2, we define Γ(λ, M , σ, W * ) = L l=1 λ l κ 2 η σ 2 l σ 2 max ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) Definition 5 (D-function). Given the Gaussian Mixture Model in ( 6) and any positive integer m, define D m (λ, M , σ) as D m (λ, M , σ) = L l=1 λ l ( ||µ l || ∞ σ l + 1) m , ( ) where λ = (λ 1 , • • • , λ L ) ∈ R L , M = (µ 1 , • • • , µ L ) ∈ R d×L and σ = (σ 1 , • • • , σ L ) ∈ R L . ρ-function is defined to compute the lower bound of the Hessian of the population risk with Gaussian input. Γ function is the weighted sum of ρ-function under mixture Gaussian distribution. This function is positive and upper bounded by a small value. It is increasing when |µ l(i) | increases. When σ l increases, Γ increases first and then decreases. Γ goes to zero if all µ l ∞ or all σ l goes to infinity. D-function is a normalized parameter for the means and variances. It is lower bounded by 1. D-function is an increasing function of µ l ∞ and a decreasing function of σ l . Property 1 We have that ν i F is a sub-Gaussian random variable with its sub-Gaussian norm bounded bu ξ √ dK. Proof: (E ν i p F ) 1 p ≤ (E| √ dKξ| p ) 1 p ≤ ξ √ dK (17) Property 2 ρ(u, σ) in Definition 3 satisfies the following properties, 1. ρ(u, σ) > 0 for any u ∈ R d and σ = 0. 2. ρ(u, σ) converges to a positive value function of σ as u i goes to 0, i.e. lim ui→0 ρ(u, σ) := C m (σ).

3.. When all

u i = 0 (i ∈ [d]), ρ( u σ , σ ) converges to a positive value function of u as σ goes to 0, i.e. lim σ→0 ρ( u σ , σ) := C s (u). When u i = 0 for some i ∈ [d], lim σ→0 ρ( u σ , σ) = 0.

4.. When everything else except |u

i | is fixed, ρ( W * u σδ K (W * ) , σδ K (W * )) is lower bounded by a positive value function, L m ( W * u σδ K (W * ) , σδ K (W * )) , which is monotonically decreasing to 0 as |u i | increases.

5.. When everything else except

σ is fixed, ρ( W * u σδ K (W * ) , σδ K (W * )) is lower bounded by a positive value function, L s ( W * u σδ K (W * ) , σδ K (W * )) , which satisfies the following conditions: (a) there exists ζ s > 0, such that σ -1 L s ( W * u σδ K (W * ) , σδ K (W * )) is an increasing function of σ when σ ∈ (0, ζ s ); (b) there exists ζ s > 0 such that L s ( W * u σδ K (W * ) , σδ K (W * )) is an decreasing function of σ when σ ∈ (ζ s , +∞). Proof: (1) From the Cauchy Schwarz's inequality, we have E zi∼N (ui,1) [φ (σ • z i )] ≤ E zi∼N (ui,1) [φ 2 (σ • z i )] E zi∼N (ui,1) [φ (σ • z i )z i • z i ] ≤ E zi∼N (ui,1) [φ 2 (σ • z i )z 2 i ] • E zi∼N (ui,1) [z 2 i ] = E zi∼N (ui,1) [φ 2 (σ • z i )z 2 i ] • u 2 i + 1 The equalities of the ( 18) and ( 19) hold if and only if φ is a constant function. Since that φ is the sigmoid function, the equalities of ( 18) and ( 19) cannot hold. By the definition of ρ(u, σ) in Definition 3, we have β 0 (i, u, σ) -α 2 0 (i, u, σ) > 0 and β 2 (i, u, σ) - α 2 2 (i,u,σ) u 2 i +1 > 0. Therefore, ρ(u, σ) > 0 (2) lim ui→0 ( u 2 j σ 2 + 1) β 0 (i, u, σ) -α 2 0 (i, u, σ) = lim ui→0 ( u 2 j σ 2 + 1) ∞ -∞ φ 2 (σ • z i )(2π) -1 2 exp(- z i -u i 2 2 )dz i -( ∞ -∞ φ (σ • z i )(2π) -1 2 exp(- z i -u i 2 2 )dz i ) 2 =( u 2 j σ 2 + 1) ∞ -∞ φ 2 (σ • z i )(2π) -1 2 exp(- z i 2 2 )dz i -( ∞ -∞ φ (σ • z i )(2π) -1 2 exp(- z i 2 2 )dz i ) 2 (21) lim ui→0 β 2 (i, u, σ) - 1 u 2 i + 1 α 2 2 (i, u, σ) = lim ui→0 ∞ -∞ φ 2 (σ • z i )z 2 i (2π) -1 2 exp(- z i -u i 2 2 )dz i -( 1 u 2 i + 1 ∞ -∞ φ (σ • z i )z 2 i (2π) -1 2 exp(- z i -u i 2 2 )dz i ) 2 = ∞ -∞ φ 2 (σ • z i )z 2 i (2π) -1 2 exp(- z i 2 2 )dz i -( ∞ -∞ φ (σ • z i )z 2 i (2π) -1 2 exp(- z i 2 2 )dz i ) 2 Combining ( 21) and ( 22), we can derive that ρ(u, σ) converges to a positive value function of σ as u i goes to 0, i.e. lim u→0 ρ(u, σ) := C m (σ) (3) When all u i = 0 (i ∈ [d]), lim σ→0 β 2 (i, u σ , σ) - 1 u 2 i σ 2 + 1 α 2 2 (i, u σ , σ) = lim σ→0 ∞ -∞ φ 2 (σ • z i )z 2 i (2π) -1 2 exp(- z i -ui σ 2 2 )dz i - 1 u 2 i σ 2 + 1 ∞ -∞ φ (σ • z i )z 2 i (2π) -1 2 exp(- z i -ui σ 2 2 )dz i 2 = lim σ→0 ∞ -∞ φ 2 (u i • x i ) u 2 i σ 2 x 2 i (2π σ 2 u 2 i ) -1 2 exp(- x i -1 2 2 σ 2 u 2 i )dx i - 1 u 2 i σ 2 + 1 ∞ -∞ φ (u i • x i ) u 2 i σ 2 x 2 i (2π σ 2 u 2 i ) -1 2 exp(- x i -1 2 2 σ 2 u 2 i )dx i 2 z i = u i σ x i = lim σ→0 φ 2 (u i ) u 2 i σ 2 - 1 u 2 i σ 2 + 1 (φ (u i ) u 2 i σ 2 ) = lim σ→0 φ 2 (u i ) u 2 i σ 2 1 - u 2 i σ 2 1 + σ 2 u 2 i 2 = lim σ→0 φ 2 (u i ) 1 1 + σ 2 u 2 i =φ 2 (u i ) (23) The third step of ( 23) is by the fact that the Gaussian distribution goes to a Dirac delta function when σ goes to 0. Then the integral will take the value when x i = 1. Similarly, we can obtain the following lim σ→0 β 0 (i, u σ , σ) -α 2 0 (i, u σ , σ) = lim σ→0 ∞ -∞ φ 2 (σ • z i )(2π) -1 2 exp(- z i -ui σ 2 2 )dz i - ∞ -∞ φ (σ • z i )(2π) -1 2 exp(- z i -ui σ 2 2 )dz i 2 =φ 2 (u i ) -φ 2 (u i ) = 0 (24) lim σ→0 ∂ ∂σ β 0 (i, u σ , σ) -α 2 0 (i, u σ , σ) = lim σ→0 ∂ ∂σ ∞ -∞ φ 2 (x i )(2πσ 2 ) -1 2 exp(- x i -u i 2 2σ 2 )dx i - ∞ -∞ φ (x i )(2πσ 2 ) -1 2 exp(- x i -u i 2 2σ 2 )dx i 2 x i = σ • z i = lim σ→0 ∞ -∞ φ 2 (x i )(2πσ 2 ) -1 2 exp(- x i -u i 2 2σ 2 )(-σ -1 + x i -u i 2 σ -2 )dx i -2 ∞ -∞ φ (x i )(2πσ 2 ) -1 2 exp(- x i -u i 2 2σ 2 )dx i • ∞ -∞ φ (x i )(2πσ 2 ) -1 2 exp(- x i -u i 2 2σ 2 )(-σ -1 + x i -u i 2 σ -2 )dx i = lim σ→0 φ 2 (u i ) -σ -2φ (u i ) φ (u i ) -σ = lim σ→0 φ 2 (u i ) σ = +∞ Therefore, by L'Hopital's rule and ( 24), ( 25), we have lim σ→0 ( u 2 j σ 2 + 1)(β 0 (i, u σ , σ) -α 0 (i, u σ , σ)) = lim σ→0 u 2 i 2σ ∂ ∂σ (β 0 (i, u σ , σ) -α 0 (i, u σ , σ)) = + ∞ Combining ( 26) and ( 23), we can derive that ρ( u σ , σ) converges to a positive value function of u as σ goes to 0, i.e. lim σ→0 ρ( u σ , σ) := C s (u). 24). Then from the Definition 3, we have lim σ→0 ρ( u σ , σ) = 0. (4) We show the statement by contradiction. Suppose that for any positive value function, h(u i ), which is monotonically decreasing to 0 as |u i | increases, there exists a When u i = 0 for some i ∈ [d], lim σ→0 ( u 2 i σ 2 + 1)(β 0 (j, u σ , σ) -α 2 (j, u σ , σ)) = 0 by ( u * i ∈ R such that h(u i ) ≥ ρ( W * u σδ K (W * ) , σδ K (W * )) ui=ui * . Then we can derive that lim ui→u * i ρ( W * u σδ K (W * ) , σδ K (W * )) ui=ui * = 0. Since that ρ( W * u σδ K (W * ) , σδ K (W * )) is continuous, we can obtain that ρ( W * u σδ K (W * ) , σδ K (W * )) ui=ui * = 0, which contradicts to the conclusion in Property 2.1. (5) The condition (b) can be easily proved as (4). Therefore, we only need to show the condition (a). When (W * u) i = 0 for all i ∈ [K], lim σ→0 ρ( W * u σδ K (W * ) , σδ K (W * )) = C s (u) > 0. Therefore, there exists ζ s > 0, such that when 0 < σ < ζ s , ρ( W * u σδ K (W * ) , σδ K (W * )) > Cs(W * u) 2 . Then we can define L s ( W * u σδ K (W * ) , σδ K (W * )) := Cs(W * u) 2ζs σ 2 such that σ -1 L s ( W * u σδ K (W * ) , σδ K (W * )) is an increasing function of σ below ρ( W * u σδ K (W * ) , σδ K (W * )). When (W * u) i = 0 for some i ∈ [K], then lim σ→0 ρ( W * u σδ K (W * ) , σδ K (W * )) = 0. We can derive lim σ→0 ρ( W * u σδ K (W * ) , σδ K (W * )) σ = lim σ→0 ∂ ∂σ ρ( W * u σδ K (W * ) , σδ K (W * )) ≥ 0 (27) The last step of ( 27) is because if the limit is negative, then ρ( W * u σδ K (W * ) , σδ K (W * )) will be negative in a small neighborhood around σ = 0, which contradicts to the fact that ρ( W * u σδ K (W * ) , σδ K (W * )) > 0. If the limit in (27) is 0, then lim σ→0 ∂ ∂σ ρ( W * u σδ K (W * ) ,σδ K (W * )) σ > 0 otherwise there will be a small neighborhood around σ = 0 in which ρ( W * u σδ K (W * ) ,σδ K (W * )) σ < 0. In this case we only need to let σ -1 L s ( W * u σδ K (W * ) , σδ K (W * )) := ρ( W * u σδ K (W * ) ,σδ K (W * )) σ . If the limit in ( 27) is positive, we can find a positive lower bound of ρ( W * u σδ K (W * ) ,σδ K (W * )) σ in a small neighborhood around σ = 0 and an increasing function of σ, σ -1 L( W * u σδ K (W * ) , σδ K (W * )) can be defined to be less than this positive lower bound. In conclusion, the condition (a) is proved. Property 3 With the notation in (6), if a function f (x) is an even function, then E x∼N (µ,σ 2 I d ) [f (x)] = E x∼ 1 2 N (µ,σ 2 I d )+ 1 2 N (-µ,σ 2 I d ) [f (x)] Proof: Denote g(x) = f (x)(2πσ 2 ) -d 2 exp(- ||x -µ|| 2 2σ 2 ) (29) E x∼N (µ,σ 2 I) [f (x)] = x∈R d g(x)dx = ∞ -∞ • • • ∞ -∞ g(x 1 , • • • , x d )dx 1 • • • dx d = ∞ -∞ • • • ∞ -∞ -∞ ∞ g(x 1 , x 2 , • • • , x d )d(-x 1 )dx 2 • • • dx d = ∞ -∞ • • • ∞ -∞ g(-x 1 , x 2 • • • , x d )dx 1 dx 2 • • • dx d = x∈R d g(-x)dx = x∈R d f (x)(2πσ 2 ) -d 2 exp(- ||x + µ|| 2 2σ 2 )dx = E x∼N (-µ,σ 2 •I d ) [f (x)] (30) Therefore, we have E x∼N (µ,σ 2 I d ) [f (x)] = E x∼ 1 2 N (µ,σ 2 I d )+ 1 2 N (-µ,σ 2 I d ) [f (x)] Property 4 Under Gaussian Mixture Model x ∼ L l=1 λ l N (µ l , σ 2 l I d ), we have the following upper bound. E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [(u x) 2t ] ≤ (2t -1)!!||u|| 2t L l=1 λ l (||µ l || ∞ + σ l ) 2t Proof: The main idea is to find an upper bound with symmetric distribution assumption first, and then apply Property 3 to extend the conclusion to the general case. (a) If the Mixed-Gaussian distribution is symmetric and L = 2, i.e. x ∼ 1 2 N (µ, σ 2 I d ) + N (-µ, σ 2 I d ) , then we first need to analyse the distribution of u x by computing the moment generating function E x∼ 1 2 N (µ,σ 2 I d )+N (-µ,σ 2 I d ) [exp(tu x)] = E[exp(t d i=1 u i x i )] = d i=1 E[exp(tu i x i )] = d i=1 { 2 j=1 1 2 ∞ -∞ exp(tu i x i ) 1 √ 2πσ exp(- (x i -(-1) j µ i ) 2 2σ 2 )dx i } = d i=1 { 2 j=1 1 2 exp(tu i (-1) j µ i ) • ∞ -∞ exp tu i (x i -(-1) j µ i ) 1 √ 2πσ exp(- (x i -(-1) j µ i ) 2 2σ 2 )dx i } = d i=1 { 1 2 exp(-tu i µ i ) + 1 2 σ 2 u 2 i t 2 + 1 2 exp(tu i µ i + 1 2 σ 2 u 2 i t 2 )} := 2 d i=1 1 2 d exp(tµ i + 1 2 t 2 σ 2 ) (33) which is the Moment Generating Function of 2 d i=1 1 2 d N (µ i , σ ). The last step of ( 33) is by expanding the multiplication of d terms. Specifically, let {s i } 2 d i=1 denote all 2 d vectors in R d taking values from 0 and 1. Let s i k (k ∈ [d] ) denote the k-th entry of s i . We define µ i = d k=1 (-1) s i k u k µ k ∈ R for i ∈ [2 d ], and σ = σ||u|| ∈ R, where u k and µ k are the k-th entry of the vector u and µ, respectively. Then we can derive the first few steps of E [(u x) 2t ] E x∼ 1 2 N (µ,σ 2 I d )+N (-µ,σ 2 I d ) [(u x) 2t ] = ∞ -∞ y 2t 2 d i=1 1 2 d 1 √ 2πσ e -(y-µ i ) 2 2σ 2 dy = 2 d i=1 1 2 d ∞ -∞ (y -µ i + µ i ) 2t 1 √ 2πσ e -(y-µ i ) 2 2σ 2 dy = 2 d i=1 1 2 d ∞ -∞ 2t p=0 2t p µ i 2t-p (y -µ ) p 1 √ 2πσ e -(y-µ i ) 2 2σ 2 dy = 2 d i=1 1 2 d 2t p=0 2t p µ i 2t-p • 0 , p is odd (p -1)!!σ 2 , p is even = 2 d i=1 1 2 d t k=0 2t 2k µ i 2t-2k σ 2k (2k -1)!! = 1 2 d t k=0 2t 2k σ 2k (2k -1)!! 2 d i=1 µ i 2t-2k The first step is by the distribution of u x we obtain from (33). The third step follows from the binomial expansion. The forth step results from the calculation of high-order moment of Gaussian distribution. The second to last step is derived from the inverse of binomial expansion. The last step is due to the substitution of summation. To compute the inner summation in the last step of (34), we have 2 d i=1 µ i 2t = 2 d i=1 (u 1 (-1) s i 1 µ 1 + u 2 (-1) s i 2 µ 2 + ... + u d (-1) s i d µ d ) 2t = 2 d i=1 p (i) 1 +•••+p (i) d =2t (2t)! p (i) 1 !p (i) 2 !...p (i) d ! (u 1 (-1) s i 1 µ 1 ) p (i) 1 ...(u d (-1) s i d µ d ) p (i) d = 2 d i=1 p (i) 1 +•••+p (i) d =2t (2t)! p (i) 1 !p (i) 2 !...p (i) d ! (u 1 µ 1 ) p (i) 1 ...(u d µ d ) p (i) d all the p i are even = 2 d i=1 p (i) 1 +•••+p (i) d =2t (2t)! p (i) 1 !p (i) 2 !...p (i) d ! (u 2 1 µ 2 1 ) q (i) 1 ...(u 2 d µ 2 d ) q (i) d q (i) j = p (i) j 2 ≤ 2 d i=1 max (2t)! p (i) 1 !p (i) 2 !...p (i) d ! (t)! q (i) 1 !q (i) 2 !...q (i) d ! } d h=1 q (i) h =t (t)! q (i) 1 !q (i) 2 !...q (i) d ! (u 2 1 µ 2 1 ) q (i) 1 ...(u 2 d µ 2 d ) q (i) d ≤ 2 d i=1 max{ (2t)! p (i) 1 !p (i) 2 !...p (i) d ! (t)! q (i) 1 !q (i) 2 !...q (i) d ! } • (u 2 1 µ 2 1 + • • • + u 2 d µ 2 d ) t ≤ 2 d i=1 max{ (2t)! p (i) 1 !p (i) 2 !...p (i) d ! (t)! q (i) 1 !q (i) 2 !...q (i) d ! } • (u 2 1 + ... + u 2 d ) t • max j {|µ j |} 2t ≤2 d ||u|| 2t • max j {|µ j |} 2t • (2t -1)!! (35) Firstly we explain the third step. For any odd p , there is a term a 0 = (u (-1) s i µ ) p • k = (u k (-1) s i k µ k ) p k among the expansion of µ 2t i , whose corresponding vector s i is (s i i , • • • , s i , • • • , s i d ). We can find a µ j such that its corresponding vector is (s i 1 , • • • , 1-s i , • • • , s i d ) , which is only different from the tuple of µ i in the -th entry. Therefore, in the expansion of µ 2t j , there exists a term a 0 = (u (-1) 1-s i µ ) p • k = (u k (-1) s i k µ k ) p k that can be cancelled out by a 0 . Therefore, there will be no odd power terms left. The third to last step of ( 35) is by the inverse binomial expansion. The second to last step is by the inequality N i=1 a i b i ≤ max{b i } • N i=1 a i , where a i and b i are positive. The last step is because (2t)! p (i) 1 !p (i) 2 !...p (i) d ! (t)! q (i) 1 !q (i) 2 !...q (i) d ! = (2t)! t! • p (i) d 1 2 p (i) d 2 2 ! • • • p (i) dm 2 ! p (i) d1 !p (i) d2 ! • • • p (i) dm ! ≤ (2t)! t! • ( 1 2 ) m ≤ (2t)! t! • ( 1 2 ) t = (2t -1)!! In the first equality of (36), p d1 , ..., p dm denote all the positive p i . Thus, we have m i=1 p di = 2t where p di ≥ 2. Therefore, m ≤ 2t 2 = t which is used in the second inequality. Therefore, combining (35), we can continue the derivation of (34) as follows. E x∼ 1 2 N (µ,σ 2 I d )+N (-µ,σ 2 I d ) [(u x) 2t ] = 1 2 d t k=0 2t 2k σ i 2k (2k -1)!! 2 d i=1 µ i 2t-2k ≤ t k=0 2t 2k • σ 2k (2k -1)!!||u|| 2t • max j {|µ j |} 2t-2k (2t -1 -2k)!! ≤(2t -1)!!||u|| 2t (||µ|| ∞ + σ ) 2t The last step is because that (2t -1 -2k)!!(2k -1)!! ≤ (2t -1 -2k)!! (2t -1)(2t -3) • • • (2t -2k + 1) k terms = (2t -1)!! (b) From Property 3, since that (u x) 2t is an even function, we have a result for a general Gaussian distribution E x∼N (µ,σ 2 I d ) [(u x) 2t ] = E x∼ 1 2 N (µ,σ 2 I d )+ 1 2 N (-µ,σ 2 I d ) [(u x) 2t ] ≤(2t -1)!!||u|| 2t (||µ|| ∞ + σ) 2t (38) Therefore, if there are L components in the Gaussian Mixture Model, then E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [(u x) 2t ] ≤ (2t -1)!!||u|| 2t L l=1 λ l (||µ l || ∞ + σ l ) 2t Property 5 With Gaussian Mixture Model (7), we have E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [||x|| 2t ] ≤ d t (2t -1)!! L l=1 λ l ( µ l ∞ + σ l ) 2t Proof: E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [||x|| 2t 2 ] =E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [( d i=1 x 2 i ) t ] =E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [d t ( d i=1 x 2 i d ) t ] ≤E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [d t d i=1 x 2t i d ] =d t-1 d i=1 L j=1 ∞ -∞ (x i -µ ji + µ ji ) 2t λ j 1 √ 2πσ exp(- (x i -µ ji ) 2 2σ 2 j )dx i =d t-1 d i=1 L j=1 2t k=1 2t k λ j |µ ji | 2t-k • 0 , k is odd (k -1)!!σ k j , k is even ≤d t-1 d i=1 L j=1 2t k=1 2t k λ j |µ ji | 2t-k σ k j • (2t -1)!! =d t-1 d i=1 L j=1 λ j (|µ ji | + σ j ) 2t (2t -1)!! ≤d t (2t -1)!! L l=1 λ l ( µ ∞ + σ l ) 2t In the 3rd step, we apply Jensen inequality because f (x) = x t is convex when x ≥ 0 and t ≥ 1. In the 4th step we apply the Binomial theorem and the result of k-order central moment of Gaussian variable. Property 6 The population risk function f (W ) is defined as f (W ) = E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [f n (W )] =E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) 1 n n i=1 (W ; x i , y i ) =E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [ (W ; x i , y i )] Based on (2), ( 3) and ( 4), we can derive its gradient and Hessian as follows. ∂ (W ; x, y) ∂w j = - 1 K y -H(W ) H(W )(1 -H(W )) φ (w j x)x = ζ(W ) • x (43) ∂ 2 (W ; x, y) ∂w j ∂w l = ξ j,l • xx (44) ξ j,l (W ) = 1 K 2 φ (w j x)φ (w l x) H(W ) 2 +y-2y•H(W ) H 2 (W )(1-H(W )) 2 , j = l 1 K 2 φ (w j x)φ (w l x) H(W ) 2 +y-2y•H(W ) H 2 (W )(1-H(W )) 2 -1 K φ (w j x) y-H(W ) H(W )(1-H(W )) , j = l Property 7 With D m (λ, M , σ) defined in definition 5, we have (i) D m (λ, M , σ)D 2m (λ, M , σ) ≤ D 3m (λ, M , σ) (ii) D m (λ, M , σ) 2 ≤ D 2m (λ, M , σ) Proof: To prove (46), we can first compare the terms L i=1 λ i a i L i=1 λ i a 2 i and L i=1 λ i a 3 i , where a i ≥ 1, i ∈ [L] and L i=1 λ i = 1. L i=1 λ i a 3 i - L i=1 λ i a i L i=1 λ i a 2 i = L i=1 λ i a i • a 2 i - L j=1 λ j a 2 j = L i=1 λ i a i • (1 -λ i )a 2 i - 1≤j≤L,j =i λ j a 2 j = L i=1 λ i a i • 1≤j≤L,j =i λ j a 2 i - 1≤j≤L,j =i λ j a 2 j = L i=1 λ i a i • 1≤j≤L,j =i λ j (a 2 i -a 2 j ) = 1≤i,j≤L,i =j λ i λ j a i (a 2 i -a 2 j ) + λ i λ j a j (a 2 j -a 2 i ) = 1≤i,j≤L,i =j λ i λ j (a i -a j ) 2 (a i + a j ) ≥ 0 The second to last step is because we can find the pairwise terms λ i a i • λ j (a 2 i -a 2 j ) and λ j a j • λ i (a 2 j -a 2 i ) in the summation that can be putted together. From (48), we can obtain L i=1 λ i a i L i=1 λ i a 2 i ≤ L i=1 λ i a 3 i (49) Combining ( 49) and the definition of D m (λ, M , σ) in ( 5), we can derive (46). Similarly, to prove (47), we can first compare the terms ( L i=1 λ i a i ) 2 and L i=1 λ i a 2 i , where a i ≥ 1, i ∈ [L] and L i=1 λ i = 1. L i=1 λ i a 2 i -( L i=1 λ i a i ) 2 = L i=1 λ i a i • a i - L j=1 λ j a j = L i=1 λ i a i • (1 -λ i )a i - 1≤j≤L,j =i λ j a j = L i=1 λ i a i • 1≤j≤L,j =i λ j a i - 1≤j≤L,j =i λ j a j = L i=1 λ i a i • 1≤j≤L,j =i λ j (a i -a j ) = 1≤i,j≤L,i =j λ i λ j a i (a i -a j ) + λ i λ j a j (a j -a i ) = 1≤i,j≤L,i =j λ i λ j (a i -a j ) 2 ≥ 0 (50) The derivation of ( 50) is close to (48). By (50) we have ( L i=1 λ i a i ) 2 ≤ L i=1 λ i a 2 i (51) Combining ( 51) and the definition of D m (λ, M , σ) in ( 5), we can derive (47).

B PROOF OF THEOREM 1

Theorem 1 is built upon three lemmas. Lemma 1 shows that with O(dK 5 log 2 d) samples, the empirical risk function is strongly convex in the neighborhood of W * . Lemma 2 shows that if initialized in the convex region, that the gradient descent algorithm converges linearly to a critical point W n , that is close to W * . Lemma 3 shows that the Tensor Initialization Method in Subroutine 1 initializes W 0 ∈ R d×K in the local convex region. Theorem 1 follows naturally by combining these three lemmas. This proving approach is built upon those in (Fu et al., 2020) . One of our major technical contribution is extending Lemmas 1 and 2 to the Gaussian mixture model, while the results in (Fu et al., 2020) only apply to Standard Gaussian models. The second major contribution is a new tensor initialization method for Gaussian mixture model such that the initial point is in the convex region (see Lemma 3). Both contributions require the development of new tools, and our analyses are much more involved than those for the standard Gaussian due to the complexity introduced by the Gaussian mixture model. To present these lemmas, the Euclidean ball B(W * , r) is used to denote the neighborhood of W * , where r is the radius of the ball. B(W * , r) = {W ∈ R d×K : ||W -W * || F ≤ r} (52) The radius of the convex region is r := Θ C 3 0 • L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) K 7 2 L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 4 with some constant C 3 > 0. Lemma 1 (Strongly local convexity) Consider the classification model with FCN (2) and the sigmoid activation function. There exists a constant C such that as long as the sample size n ≥ C 1 -2 0 • L l=1 λ l ( µ ∞ + σ l ) 2 2 ( L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) -2 dK 5 log 2 d (54) for some constant C 1 > 0 and 0 ∈ (0, 1 4 ), we have for all W ∈ B(W * , r F CN ), Ω 1 -2 0 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) • I dK ∇ 2 f n (W ) C 2 L l=1 λ l (||µ l || ∞ + σ l ) 2 • I dK (55) with probability at least 1 -d -10 for some constant C 2 > 0. Lemma 2 (Linear convergence of gradient descent) Assume the conditions in Lemma 1 hold. If the local convexity holds, there exists a critical point in B(W * , r) for some constant C 3 > 0 and 0 ∈ (0, 1 2 ), such that || W n -W * || F ≤ O( K 5 2 L l=1 λ l ( µ ∞ + σ l ) 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) d log n n ) ( ) If the initial point W 0 ∈ B(W * , r), the gradient descent linearly converges to W n , i.e., ||W t -W n || F ≤ 1 -Ω L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) K 2 L l=1 λ l ( µ l ∞ + σ l ) 2 t ||W 0 -W n || F with probability at least 1 -d -10 . Lemma 3 (Tensor initialization) For classification model, with D 6 (λ, M , σ) defined in Definition 5, we have that if the sample size n ≥ κ 8 K 4 τ 12 D 6 (λ, M , σ) • d log 2 d, then the output W 0 ∈ R d×K satisfies 3 ||W 0 -W * || κ 6 K 3 • τ 6 D 6 (λ, M , σ) d log n n ||W * || (59) with probability at least 1 -n -Ω(δ 4 1 )

Proof of Theorem 1 and Corollary 1:

From Lemma 2 and Lemma 3, we know that if n is sufficiently large such that the initialization W 0 by the tensor method is in the region B(W * , r), then the gradient descent method converges to a critical point W n that is sufficiently close to W * . To achieve that, one sufficient condition is ||W 0 -W * || F ≤ √ K||W 0 -W * || ≤ κ 6 K 7 2 • τ 6 D 6 (λ, M , σ) d log n n ||W * || ≤ C 3 0 Γ(λ, M , σ, W * )σ 2 max K 7 2 L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 4 (60) where the first inequality follows from ||W || F ≤ √ K||W || for W ∈ R d×K , the second inequality comes from Lemma 3, and the third inequality comes from the requirement to be in the region B(W * , r). That is equivalent to the following condition n ≥C 0 -2 0 • τ 12 κ 12 K 14 L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 2 (δ 1 (W * )) 2 D 6 (λ, M , σ) • Γ(λ, M , σ, W * ) -2 σ -4 max • d log 2 d (61) where C 0 = max{C 4 , C -2 3 }. By the definition 5, we can obtain L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 2 ≤ D 4 (λ, M , σ)D 8 (λ, M , σ)σ 6 max (62) From Property 7, we have that D 4 (λ, M , σ)D 8 (λ, M , σ)D 6 (λ, M , σ) ≤ D 12 (λ, M , σ) D 12 (λ, M , σ) = D 12 (λ, M , σ) Plugging ( 62), ( 63) into (61), we have n ≥ C 0 -2 0 • κ 12 K 14 (σ max δ 1 (W * )) 2 τ 12 Γ(λ, M , σ, W * ) -2 D 12 (λ, M , σ) • d log 2 d (64) Considering the requirements on the sample complexity in ( 54), ( 58) and ( 64), ( 64) shows a sufficient number of samples. Taking the union bound of all the failure probabilities in Lemma 1, and 3, (64) holds with probability 1 -d -10 . By Property 2.3, ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) can be lower bounded by positive and monotonically decreasing functions L m ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) when everything else except |µ l(i) | is fixed, or L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) when everything else except σ l is fixed. Then, by substituting the lower bound of ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) for itself in Γ(λ, M , σ, W * ), we can have an upper bound of (σ max δ 1 (W * )) 2 τ 12 Γ(λ, M , σ, W * ) -2 D 12 (λ, M , σ), denoted as B(λ, M , σ, W * ). To be more specific, when everything else except |µ l(i) | is fixed, L m ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) is plugged in B(λ, M , σ, W * ). Then since that D 12 (λ, M , σ, W * ) and L m ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) -2 are both increasing function of µ l(i) , B(λ, M , σ, W * ) is an increasing function of |µ l(i) |. When everything else except σ l is fixed, if σ l = σ max > ζ s , then σ 2 max τ 12 D 12 (λ, M , σ, W * ) is an increasing function of σ l . Since that L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) is a decreasing function, L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) -2 is an increasing function of σ l . Hence, B(λ, M , σ, W * ) is an increasing function of σ l . Moreover, when all σ l < ζ s and go to 0, two decreasing functions of σ l , σ 2 max L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) -2 and D 12 (λ, M , σ) will be the dominant term of B(λ, M , σ, W * ). Therefore, B(λ, M , σ, W * ) increases to infinity as all σ l 's go to 0. Hence, we have n ≥ poly( -1 0 , κ, K)B(λ, M , σ, W * ) • d log 2 d (65) Similarly, by replacing ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) with L m ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) when every- thing else except |µ l(i) | is fixed, or L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) (or σ -2 L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) for σ ≥ 1) when everything else except σ l is fixed, (57) can also be transferred to another feasible upper bound. We denote the modified version of the convergence rate as v = 1 -K -2 q(λ, M , σ, W * ). Since that q(λ, M , σ, W * ) is a ratio between the smallest and the largest singular value of ∇ 2 f (W * ), we have q(λ, M , σ, W * ) ∈ (0, 1). Hence, we can obtain 1 -K -2 q(λ, M , σ, W * ) ∈ (0, 1) by K ≥ 1. When everything else except |µ l(i) | is fixed, since that L m ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) is monotonically decreasing and L l=1 λ( µ ∞ + σ l ) 2 is increasing as |µ l(i) | increases, v is an increasing function of |µ l(i) | to 1. Similarly, when ev- erything else except σ l is fixed where σ l ≥ max{1, ζ s }, 1 L l=1 λ l ( µ l ∞+σl ) 2 decreases to 0 as σ l increases. We replace ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) by σ -2 L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) and then σ 2 • σ -2 L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) = L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) is an decreasing function less than ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) . Therefore, v is an increasing function of σ l to 1 when σ l ≥ max{1, ζ s }. When everything else except all σ l ≤ ζ s 's go to 0, all L s ( W * µ l σ l δ K (W * ) , σ l δ K (W * )'s and σ 2 l L l=1 λ l ( µ l ∞+σl ) 2 's decrease to 0. Therefore, v increases to 1. The bound of W n -W * F is directly from (56).

C PROOF OF LEMMA 1

We first state some important lemmas used in proof in Section C.1 and describe the proof in Section C.2. The proofs of these lemmas are provided in Section C.3 to C.7 in sequence. The proof idea mainly follows from (Fu et al. (2020) ). Lemma 6 shows the Hessian ∇ 2 f (W ) of the population risk function is smooth. Lemma 7 illustrates that ∇ 2 f (W ) is strongly convex in the neighborhood around µ * . Lemma 8 shows the Hessian of the empirical risk function ∇ 2 f n (W * ) is close to its population risk ∇ 2 f (W * ) in the local convex region. Summing up these three lemmas, we can derive the proof of Lemma 1. Lemma 4 is used in the proof of Lemma 7. Lemma 5 is used in the proof of Lemma 8. The analysis of the Hessian matrix of the population loss in (Fu et al., 2020) and (Zhong et al., 2017b) can not be extended to the Gaussian mixture model. To solve this problem, we develop new tools using some good properties of symmetric distribution and even function. Our approach can also be applied to other activations like tanh or erf. Moreover, if we directly apply the existing matrix concentration inequalities in these works in bounding the error between the empirical loss and the population loss, the resulting sample complexity would be O(d 3 ) and cannot reflect the influence of each component of the Gaussian mixture distribution. We develop a new version of Bernstein's inequality (see ( 137)) so that the final bound is O(d log 2 d). ( Mei et al. (2018a) ) showed that the landscape of the empirical risk is close to that of the population risk when the number of samples is sufficiently large for the special case that K = 1. Focusing on Gaussian mixture models, our result explicitly shows how the parameters of the input distribution, including the proportion, mean and variance of each component will affect the error bound between the empirical loss and the population loss in Lemma 8.

C.1 USEFUL LEMMAS IN THE PROOF OF LEMMA 1

Lemma 4 E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) ( k i=1 p i x • φ (σ • x i )) 2 ≥ ρ(µ, σ)||P || 2 F , where ρ(µ, σ) is defined in Definition 3. Lemma 5 With the FCN model ( 2) and the Gaussian Mixture Model ( 7), for some constant C 12 > 0, we have E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) sup W =W ∈B(W * ,r) ||∇ 2 (W , x) -∇ 2 (W , x)|| ||W -W || F ≤C 12 • d 3 2 K 5 2 L l=1 λ l ( µ ∞ + σ l ) 2 L l=1 λ l ( µ ∞ + σ l ) 4 (67) Lemma 6 (Hessian smoothness of population loss)In the FCN model ( 2), assume ||w * k || 2 ≤ 1 for all k. Then for some constant C 5 > 0, we have ||∇ 2 f (W ) -∇ 2 f (W * )|| ≤ C 5 • K 3 2 • L l=1 λ l ( µ ∞ + σ l ) 4 L l=1 λ l ( µ l + σ l ) 8 1 4 • ||W -W * || F (68) Lemma 7 (Local strong convexity of population loss) In the FCN model (2), if ||W -W * || F ≤ r for an 0 ∈ (0, 1 4 ), then for some constant C 4 > 0, 4(1 -0 ) K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * ))•I dK ∇ 2 f (W ) C 4 L l=1 λ l ( µ l ∞ +σ l ) 2 •I dK (69) Lemma 8 In the FCN model (2), as long as n ≥ C • dK log dK for some constant C > 0, we have sup W ∈B(W * ,r F CN ) ||∇ 2 f n (W ) -∇ 2 f (W )|| ≤ C 6 • L l=1 λ l ( µ ∞ + σ l ) 2 dK log n n ) ( ) with probability at least 1 -d -10 for some constant C 6 > 0.

C.2 PROOF OF LEMMA 1

From Lemma 7 and 8, with probability at least 1 -d -10 , ∇ 2 f n (W ) ∇ 2 f (W ) -||∇ 2 f (W ) -∇ 2 f n (W )|| • I Ω 1 -0 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) • I -O C 6 • L l=1 λ l ( µ l + σ l ) 2 dK log n n • I (71) As long as the sample complexity is set to satisfy C 6 • L l=1 λ l ( µ l ∞ + σ l ) 2 dK log n n ≤ 0 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) i.e., n ≥ C 1 -2 0 • L l=1 λ l ( µ ∞ + σ l ) 2 2 ( L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) -2 dK 5 log 2 d (73) for some constant C 1 > 0, then we have the lower bound of the Hessian with probability at least 1 -d -10 . ∇ 2 f n (W ) Ω 1 -2 0 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) • I (74) By ( 69) and ( 70), we can also derive the upper bound as follows, ||∇ 2 f n (W )|| ≤ ||∇ 2 f (W )|| + ||∇ 2 f n (W ) -∇ 2 f (W )|| ≤ C 4 • L l=1 λ l ( µ l ∞ + σ l ) 2 + C 6 • 1=1 λ l ( µ ∞ + σ l ) 2 dK log n n ≤ C 2 • L l=1 λ l ( µ l ∞ + σ l ) 2 (75) for some constant C 2 > 0. Combining ( 74) and ( 75), we have Ω 1 -2 0 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) • I ∇ 2 f n (W ) C 2 L l=1 λ l (||µ l || ∞ + σ l ) 2 • I (76) with probability at least 1 -d -10 .

C.3 PROOF OF LEMMA 4

Following the proof idea in Lemma D.4 of (Zhong et al., 2017b) , we have E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) ( k i=1 p i x • φ (σ • x i )) 2 = A 0 + B 0 A 0 = E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) k i=1 p i x • φ 2 (σ • x i ) • xx p i B 0 = E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) i =l p i φ (σ • x i )φ (σ • x l ) • xx p l In A 0 , we know that E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) x j = 0. Therefore, A 0 = k i=1 E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) p i φ 2 (σ • x i ) x 2 i e i e i + j =i x i x j (e i e j + e j e i ) + j =i l =i x j x l e j e l p i = k i=1 E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) p i φ 2 (σ • x i ) x 2 i e i e i + j =i x 2 j e j e j p i = k i=1 E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) [φ 2 (σ • x i )x 2 i ]p i e i e i p i + j =i E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) [x 2 j ]E x∼ 1 2 N (µ,I)+ 1 2 N (-µ,I) [φ 2 (σ • x i )]p i e j e j p i = k i=1 p 2 ii β 2 (i, µ, σ) + k i=1 j =i p 2 ij β 0 (i, µ, σ)(1 + µ 2 j ) In B 0 , α 1 (i, µ, σ) = E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) (x i φ (x i )) = 0. By the equation in Page 30 of (Zhong et al., 2017b) , we have B 0 = k i =l E x∼ 1 2 N (µ,I d )+ 1 2 N (-µ,I d ) p i φ (σ • x i )φ (σ • x l ) x 2 i e i e i + x 2 l e l e l + x i x l (e i e l + e l e i ) + j =i x j x l e j e l + j =l x j x i e j e i + j =i,l j =i,l x j x j e j e j p l = i =l p ii p li α 2 (i, µ, σ)α 0 (l, µ, σ) + i =l p ij p lj α 0 (i, µ, σ)α 0 (l, µ, σ)(1 + µ 2 j ) (81) Therefore, A 0 + B 0 = k i=1 p ii α 2 (i, µ, σ) 1 + µ 2 i + l =i p li α 0 (l, µ, σ) 1 + µ 2 i 2 - k i=1 p 2 ii α 2 2 (i, µ, σ) 1 + µ 2 i - k i=1 l =i p 2 li α 0 (l, µ, σ) 2 (1 + µ 2 i ) + k i=1 p 2 ii β 2 (i, µ, σ) + k i=1 j =i p 2 ij β 0 (i, µ, σ)(1 + µ 2 j ) ≥ k i=1 p 2 ii β 2 (i, µ, σ) - α 2 2 (i, µ, σ) 1 + µ 2 i + k i=1 j =i p 2 ij β 0 (i, µ, σ) -α 2 0 (i, µ, σ) (1 + µ 2 j ) ≥ ρ(µ, σ)||P || 2 F (82) Under review as a conference paper at ICLR 2021 C.4 PROOF OF LEMMA 5 Following the equation ( 92) in Lemma 8 of (Fu et al., 2020) and by ( 45) ||∇ 2 (W ) -∇ 2 (W )|| ≤ K j=1 K l=1 |ξ j,l (W ) -ξ j,l (W )| • ||xx || By Lagrange's inequality, we have |ξ j,l (W ) -ξ j,l (W )| ≤ (max k |T j,k,l |) • ||x|| • √ K||W -W || F From Lemma 6, we know max k |T j,k,l | ≤ C 7 By Property 5, we have E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [||x|| 2t ||] ≤ d t (2t -1)!! L l=1 λ l ( µ l ∞ + σ l ) 2t Therefore, for some constant C 12 > 0 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [ sup W =W ||∇ 2 (W ) -∇ 2 (W )|| ||W -W || F ] ≤ K 5 2 E[||x|| 3 2 ] ≤K 5 2 d L l=1 λ l ( µ ∞ + σ l ) 2 3d 2 L l=1 λ l ( µ l ∞ + σ l ) 4 =C 12 • d 3 2 K 5 2 L l=1 λ l ( µ l ∞ + σ l ) 2 L l=1 λ l ( µ l ∞ + σ l ) 4 C.5 PROOF OF LEMMA 6 Let a = (a 1 , • • • , a K ) ∈ R dK . Let ∆ j,l ∈ R d×d be the (j, l)-th block of ∇ 2 f (W ) - ∇ 2 f (W * ) ∈ R dK×dK . By definition, ||∇ 2 f (W ) -∇ 2 f (W * )|| = max ||a||=1 K j=1 K l=1 a j ∆ j,l a l By the mean value theorem and (45), ∆ j,l = ∂ 2 f (W ) ∂w j ∂w l - ∂ 2 f (W * ) ∂w * j ∂w * l = E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [(ξ j,l (W ) -ξ j,l (W * )) • xx ] = E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [ K k=1 ∂ξ j,l (W ) ∂w k , w k -w * k • xx ] = E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [ K k=1 T j,l,k • x, w k -w * k • xx ] where W = γW + (1 -γ)W * for some γ ∈ (0, 1) and T j,l,k is defined such that ∂ξ j,l (W ) ∂w k = T j,l,k • x ∈ R d . Then we provide an upper bound for ξ j,l . Since that y = 1 or 0, we first compute the case in which y = 1. From (45) we can obtain ξ j,l (W ) = 1 K 2 φ (w j x)φ (w l x) • 1 H 2 (W ) , j = l 1 K 2 φ (w j x)φ (w l x) • 1 H 2 (W ) -1 K φ (w j x) • 1 H(W ) , j = l We can bound ξ j,l (W ) by bounding each components of (90). Note that we have 1 K 2 φ (w j x)φ (w l x) • 1 H 2 (W ) ≤ 1 K 2 φ(w j x)φ(w l x)(1 -φ(w j x))(1 -φ(w l x)) 1 K 2 φ(w j x)φ(w l x) ≤ 1 (91) 1 K φ (w j x) • 1 H(W ) ≤ 1 K φ(w j x)(1 -φ(w j x))(1 -2φ(w j x)) 1 K φ(w j x) ≤ 1 where ( 91) holds for any j, l ∈ [K]. The case y = 0 can be computed with the same upper bound by substituting 90), ( 91) and ( 92). Therefore, there exists a constant C 9 > 0, such that (1 -H(W )) = 1 K K j=1 (1 -φ(w j x)) for H(W ) in ( |ξ j,l (W )| ≤ C 9 We then need to calculate T j,l,k . Following the analysis of ξ j,l (W ), we only consider the case of y = 1 here for simplicity. T j,l,k = -2 K 3 H 3 (W ) φ (w j x)φ (w l x)φ (w k x) , where j, l, k are not equal to each other (94) T j,j,k = -2 K 3 H 3 (W ) φ (w j x)φ (w j x)φ (w k x) + 1 K 2 H 2 (W ) φ (w j x)φ (w k x), j = k -2 K 3 H 3 (W ) (φ (w j x)) 3 + 3 K 2 H 2 (W ) φ (w j x)φ (w j x) - φ (w j x) KH(W ) , j = k (95) a j ∆ j,l a l = E x∼ L l=1 N (µ l ,σ 2 l I) [( K k=1 T j,l,k x, w k -w * k ) • (a j x)(a l x)] ≤ E x∼ L l=1 N (µ l ,σ 2 l I) [ K k=1 T 2 j,k,l ] • E[ K k=1 ( x, w k -w * k (a j x)(a l x)) 2 ] ≤ E x∼ L l=1 N (µ l ,σ 2 l I) [ K k=1 T 2 j,k,l ] K k=1 E((w k -w * k ) x) 4 • E[(a j x) 4 (a l x) 4 ] ≤ C 8 E x∼ L l=1 N (µ l ,σ 2 l I) [ K k=1 T 2 j,k,l ] K k=1 ||w k -w * k || 2 2 • ||a j || 2 2 • ||a l || 2 2 • L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 4 (96) for some constant C 8 > 0. All the three inequalities of (96) are derived from Cauchy-Schwarz inequality. Note that we have -2 K 3 H 3 (W ) (φ (w j x)) 2 φ (w k x) ≤ 2φ 2 (w j x)(1 -φ(w j x)) 2 φ(w k x)(1 -φ(w k x)) K 3 1 K 3 φ 2 (w j x)φ(w k x) = 2(1 -φ(w j x)) 2 (1 -φ(w k x)) ≤ 2 (97) -2 K 3 H 3 (W ) φ (w j x)φ (w l x)φ (w k x) ≤ 2 (98) 3 K 2 H 2 (W ) φ (w j x)φ (w k x) ≤ 3φ(w j x)(1 -φ(w j x))(1 -2φ(w j x))φ(w k x)(1 -φ(w k x)) K 2 1 K 2 φ(w j x)φ(w k x) = 3(1 -φ(w j x))(1 -2φ(w j x))(1 -φ(w k x)) ≤ 3 (99) φ (w j x) KH(W ) ≤ φ(w j x)(1 -φ(w j x))(1 -6φ(w j x) + 6φ 2 (w j x)) K 1 K φ(w j x) ≤ 1 Therefore, by combining ( 94), ( 95) and ( 97) to (100), we have |T j,l,k | ≤ C 7 ⇒ T 2 j,l,k ≤ C 2 7 , ∀j, l, k ∈ [K], for some constants C 7 > 0. By ( 88), ( 89), ( 96), ( 101) and the Cauchy-Schwarz's Inequality, we have ∇ 2 f (W ) -∇ 2 f (W * ) ≤C 8 C 2 7 K||W -W * || F L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 4 • max ||a||=1 K j=1 K l=1 ||a j || 2 ||a l || 2 ≤C 8 C 2 7 K • ||W -W * || F • L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 4 • K j=1 ||a j || 2 ≤C 8 C 2 7 K 3 • ||W -W * || F • L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 4 (102) Hence, we have ||∇ 2 f (W ) -∇ 2 f (W * )|| ≤ C 5 K 3 2 L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 4 ||W -W * || F (103) for some constant C 5 > 0. C.6 PROOF OF LEMMA 7 From (Fu et al. ( 2020)), we know ∇ 2 f (W * ) min ||a||=1 4 K 2 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) K j=1 φ (w * j x)(a j x) 2 • I dK (104) with a = (a 1 , • • • , a K ) ∈ R dK . And ∇ 2 f (W * ) max ||a||=1 a ∇ 2 f (W * )a •I dK C 4 • max ||a||=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) K j=1 (a j x) 2 •I dK ) for some constant C 4 > 0. By applying Property 4, we can derive the upper bound in (105) as C 4 • E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) K j=1 (a j x) 2 • I dK C 4 • L l=1 λ l ( µ l ∞ + σ l ) 2 • I dK To find a lower bound for (104), we can first transfer the expectation of the Gaussian Mixture Model to the weight sum of the expectations over general Gaussian distributions. min ||a||=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) K j=1 φ (w * j x)(a j x) 2 = min ||a||=1 L l=1 λ l E x∼N (µ l ,σ 2 l I d ) K j=1 φ (w * j x)(a j x) 2 Denote U ∈ R d×k as the orthogonal basis of W * . For any vector a i ∈ R d , there exists two vectors b d-K) denotes the complement of U . We also have U ⊥ µ l = 0 by (9). Plugging (108) into RHS of (107), and then we have i ∈ R K and c i ∈ R d-K such that a i = U b i + U ⊥ c i (108) where U ⊥ ∈ R d×( E x∼N (µ l ,σ 2 l I d ) K i=1 a i x • φ (w * i x) 2 =E x∼N (µ l ,σ 2 l I d ) K i=1 (U b i + U ⊥ c i ) x • φ (w * i x) 2 = A + B + C (109) A = E x∼N (µ l ,σ 2 l I d ) K i=1 b i U x • φ (w * i x) 2 (110) C = E x∼N (µ l ,σ 2 l I d ) 2 K i=1 c i U ⊥ x • φ (w * i x) • K i=1 b i U x • φ (w * i x) = K i=1 K j=1 E x∼N (µ l ,σ 2 l I d ) 2c i U ⊥ x E x∼N (µ l ,σ 2 l I d ) b i U x • φ (w * i x)φ (w * j x) = K i=1 K j=1 2c i U ⊥ µ l E x∼N (µ l ,σ 2 l I d ) b i U x • φ (w * i x)φ (w * j x) = 0 (111) where the last step is by U ⊥ µ l = 0 by (9). B =E x∼N (µ l ,σ 2 l I d ) ( K i=1 c i U ⊥ x • φ (w * i x)) 2 =E x∼N (µ l ,σ 2 l I d ) [(t s) 2 ] by defining t = k i=1 φ (w * i x)c i ∈ R d-K and s = U ⊥ x = K i=1 E[t 2 i s 2 i ] + i =j E[t i t j s i s j ] = K i=1 E[t 2 i ]σ 2 l + K i=1 E[t 2 i ](U ⊥ µ l ) 2 i + i =j E[t i t j ](U ⊥ µ l ) i • (U ⊥ µ l ) j =E[ d-K i=1 t 2 i σ 2 l ] + E[(t U ⊥ µ l ) 2 ] = E[ d-K i=1 t 2 i σ 2 l ] (112) The last step is by U ⊥ µ l = 0. The 4th step is because that s i is independent of t i , thus E[t i t j s i s j ] = E[t i t j ]E[s i s j ] E[s i s j ] = (U ⊥ µ l ) i • (U ⊥ µ l ) j , if i = j (U ⊥ µ l ) 2 i + σ 2 l , if i = j (113) Since k i=1 p i x • φ (σ • x i ) 2 is an even function, so from Property 3 we have E x∼N (µ l ,σ 2 l I d ) ( k i=1 p i x • φ (σ • x i )) 2 = E x∼ 1 2 N (µ l ,σ 2 l I d )+ 1 2 N (-µ l ,σ 2 l I d ) ( k i=1 p i x • φ (σ • x i )) 2 (114) Combining Lemma 4 and Property 3, we next follow the derivation for the standard Gaussian distribution in Page 36 of (Zhong et al., 2017b) and generalize the result to a Gaussian distribution with an arbitrary mean and variance as follows. A = E x∼N (µ l ,σ 2 l I d ) K i=1 b i U x • φ (w * i x) 2 = (2πσ 2 l ) -K 2 K i=1 b i z • φ (v i z) 2 exp - 1 2σ 2 l z -U µ l 2 dz = (2πσ 2 l ) -K 2 K i=1 b i V † s • φ (s i ) 2 exp - 1 2σ 2 l V † s -U µ l 2 det(V † ) ds ≥ (2πσ 2 l ) -K 2 k i=1 b i V † s • φ (s i ) 2 exp - s -V U µ l 2 2δ 2 K (W * )σ 2 l det(V † ) ds ≥ (2π) -K 2 σ -K l k i=1 b i V † (δ K (W * )σ l )g • φ (δ K (W * )σ l • g i ) 2 • exp - ||g -W * µ l σ l δ K (W * ) || 2 2 det(V † ) σ K l δ K K (W * )dg = σ 2 l η E g ( K i=1 (b i V † δ K (W * ))g • φ (σ l δ K (W * ) • g i )) 2 ≥ σ 2 l κ 2 η ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * ))||b|| 2 . (115) The second step is by letting z = U x. The third step is by letting s = V z. The last to second step follows from g = s σ l δ K (W * ) , where g ∼ N ( W * µ l σ l δ K (W * ) , I K ) and the last inequality is by Lemma 4. Similarly, we extend the derivation in Page 37 of (Zhong et al., 2017b) for the standard Gaussian distribution to a general Gaussian distribution as follows. B = σ 2 l E x∼N (µ l ,σ 2 l I d ) [||t|| 2 ] ≥ σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * ))||c|| 2 Combining ( 109) -( 112), ( 115) and ( 116), we have min ||a||=1 E x∼N (µ l ,σ 2 l I d ) ( k i=1 a i x • φ (w * i x)) 2 ≥ σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )). For the Gaussian Mixture Model x ∼ L l=1 N (µ l , σ 2 l I d ), we have 68) in Lemma 6, since that we have the condition W -W * F ≤ r and (53), we can obtain min ||a||=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) ( k i=1 a i x • φ (w * i x)) 2 ≥ L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) (118) Therefore, 4 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ k (W * )) • I dK ∇ 2 f (W * ) C 4 • L l=1 λ l ( µ l ∞ + σ l ) 2 • I dK (119) From ( ||∇ 2 f (W ) -∇ 2 f (W * )|| ≤C 5 K 3 2 L l=1 λ l ( µ l ∞ + σ l ) 4 L l=1 λ l ( µ l ∞ + σ l ) 8 1 4 ||W -W * || F ≤ 4 0 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )), where 0 ∈ (0, 1 4 ). Then we have ||∇ 2 f (W )|| ≥ ||∇ 2 f (W * )|| -||∇ 2 f (W ) -∇ 2 f (W )|| ≥ 4(1 -0 ) K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) (121) ||∇ 2 f (W )|| ≤ ||∇ 2 f (W * )|| + ||∇ 2 f (W ) -∇ 2 f (W )|| ≤ C 4 • L l=1 λ l ( µ l ∞ + σ l ) 2 + 4 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) C 4 • L l=1 λ l ( µ ∞ + σ l ) 2 The last inequality of (122) holds since 121) and ( 122), we have ||W -W j(W ) || F ≤ for all W ∈ B(W * , r). C 4 • l=1 λ l ( µ ∞ + σ l ) 2 = Ω(σ 2 max ), 4 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) = O( σ 2 max K 2 ) and O(σ 2 max ) ≥ Ω( σ 2 max K 2 ). Combining ( 4(1 -0 ) K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) • I ∇ 2 f (W ) C 4 • L l=1 λ l ( µ ∞ + σ l ) 2 • I Then for any W ∈ B(W * , r), we have ∇ 2 f n (W ) -∇ 2 f (W ) ≤ 1 n || n i=1 [∇ 2 (W ; x i ) -∇ 2 (W j(W ) ; x i )]|| + || 1 n n i=1 ∇ 2 (W j(W ) ; x i ) -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W j(W ) ; x i )]|| + ||E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W j(W ) ; x i )] -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W ; x i )]|| Hence, we have P sup W ∈B(W * ,r) ||∇ 2 f n (W ) -∇ 2 f (W )|| ≥ t ≤ P(A t ) + P(B t ) + P(C t ) where A t , B t and C t are defined as A t = { sup W ∈B(W * ,r) 1 n || n i=1 [∇ 2 (W ; x i ) -∇ 2 (W j(W ) ; x i )]|| ≥ t 3 } (126) B t = { sup W ∈B(W * ,r) || 1 n n i=1 ∇ 2 (W j(W ) ; x i ) -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W j(W ) ; x i )]|| ≥ t 3 } (127) C t ={ sup W ∈B(W * ,r) ||E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W j(W ) ; x i )] -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W ; x i )]|| ≥ t 3 } Then we bound P(A t ), P(B t ) and P(C t ) separately. 1) Upper bound on P(B t ). By Lemma 6 in (Fu et al., 2020) , we obtain 1 n n i=1 ∇ 2 (W ; x i ) -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W ; x i )] ≤2 sup v∈V 1 4 v, ( n n i=1 ∇ 2 (W ; x i ) -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W ; x i )])v where V 1 4 is a 1 4 -cover of the unit-Euclidean-norm ball B(0, 1) with log |V 1 4 | ≤ dK log 12. Taking the union bound over W and V 1 4 , we have P(B t ) ≤P sup W ∈W ,v∈V 1 4 1 n n i=1 G i ≥ t 6 ≤ exp(dK(log 3r + log 12)) sup W ∈W ,v∈V 1 4 P(| 1 n n i=1 G i | ≥ t 6 ) where G i = v, (∇ 2 (W , x i ) -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W , x i )]v) and E[G i ] = 0. Here v = (u 1 , • • • , u K ) ∈ R dK . |G i | = K j=1 K l=1 ξ j,l u j xx u l -E x∼ L l=1 λ l N (µ l ,σ 2 l I) (ξ j,l u j xx u l ) ≤ C 9 • K j=1 (u j x) 2 + K j=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) (u j x) 2 (131) for some C 9 > 0. The first step of ( 131) is by (44) . The last step is by ( 93) and the Cauchy-Schwarz's Inequality.

E[|G

i | p ] ≤ p l=1 p l C 9 • E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) • ( K j=1 (u j x) 2 ) l K j=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) (u j x) 2 p-l = p l=1 p l C 9 • E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) l1+•••+l K =l l! K j=1 l j ! K j=1 (u j x) 2lj • K j=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) (u j x) 2 p-l = p l=1 p l C 9 • l1+•••+l K =l l! K j=1 l j ! K j=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) (u j x) 2lj • K j=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) (u j x) 2 p-l = C 9 • p l=1 p l K j=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) (u j x) 2 l • K j=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) (u j x) 2 p-l = C 9 • K j=1 E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) (u j x) 2 p ≤ C 9 • K j=1 1!!||u j || 2 L l=1 λ l ( µ l ∞ + σ l ) 2 p ≤ C 9 • L l=1 λ l ( µ l ∞ + σ l ) 2 p ( ) where the second to last inequality results from Property 4. The last inequality is because v ∈ V 1 4 , K j=1 ||u j || 2 = ||v|| 2 ≤ 1. E[exp(θG i )] = 1 + θE[G i ] + ∞ p=2 θ p E[|G i | p ] p! ≤ 1 + ∞ p=2 |eθ| p p p C 9 • l=1 λ l ( µ l ∞ + σ l ) 2 p ≤ 1 + C 9 • |eθ| 2 L l=1 λ l ( µ l ∞ + σ l ) 2 2 (133) where the first inequality holds from p! ≥ ( p e ) p and (132), and the third line holds provided that max p≥2 { |eθ| (p+1) (p+1) (p+1) • L l=1 λ l ( µ l ∞ + σ l ) 2 p+1 |eθ| p p p • L l=1 λ l ( µ l ∞ + σ l ) 2 p } ≤ 1 2 Note that the quantity inside the maximization in (134) achieves its maximum when p = 2, because it is monotonously decreasing. Therefore, (134 ) holds if θ ≤ 27 4e L l=1 λ l ( µ l ∞ + σ l ) 2 . Then P 1 n n i=1 G i ≥ t 6 = P exp(θ n i=1 G i ) ≥ exp( nθt 6 ) ≤ e -nθt 6 n i=1 E[exp(θG i )] ≤ exp(C 10 θ 2 n L l=1 λ l ( µ l ∞ + σ l ) 2 2 - nθt 6 ) for some constant C 10 > 0. The first inequality follows from the Markov's Inequality. When θ = min{ t 12C10 L l=1 λ l ( µ l ∞+σl ) 2 2 , 27 4e L l=1 λ l ( µ l ∞ + σ l ) 2 }, we have a modified Bernstein's Inequality for the Gaussian Mixture Model as follows P( 1 n n i=1 G i ≥ t 6 ) ≤ exp max{ - C 10 nt 2 144 L l=1 λ l ( µ l ∞ + σ l ) 2 2 , -C 11 n L l=1 λ l ( µ l ∞ + σ l ) 2 • t} for some constant C 11 > 0. We can obtain the same bound for P(- 1 n n i=1 G i ≥ t 6 ) by replacing G i as -G i . Therefore, we have P(| 1 n n i=1 G i | ≥ t 6 ) ≤ 2 exp max{ - C 10 nt 2 144 L l=1 λ l ( µ l ∞ + σ l ) 2 2 , -C 11 n L l=1 λ l ( µ l ∞ + σ l ) 2 • t} Thus, as long as t ≥ C 6 • max{ L l=1 λ l ( µ l ∞ + σ l ) 2 dK log 36r + log 4 δ n , dK log 36r + log 4 δ L l=1 λ l ( µ l ∞ + σ l ) 2 n } (138) for some large constant C 6 > 0, we have P(B t ) ≤ δ 2 . 2) Upper bound on P(A t ) and P(C t ). From Lemma 5, we can obtain sup W ∈B(W * ,r) ||E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W j(W ) ; x)] -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W ; x)]|| ≤ sup W ∈B(W * ,r) ||E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W j(W ) ; x)] -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ 2 (W ; x)]|| ||W -W j(W ) || F • sup W ∈B(W * ,r) ||W -W j(W ) || F ≤ C 12 • d 3 2 K 5 2 L l=1 λ l ( µ l ∞ + σ l ) 2 L l=1 λ l ( µ l ∞ + σ l ) 4 • (139) Therefore, C t holds if t ≥ C 12 • d 3 2 K 5 2 L l=1 λ l ( µ l ∞ + σ l ) 2 L l=1 λ l ( µ l ∞ + σ l ) 4 • (140) We can bound the A t as below.

P sup

W ∈B(W * ,r) 1 n || n i=1 [∇ 2 (W j(W ) ; x i ) -∇ 2 (W ; x i )]|| ≥ t 3 ≤ 3 t E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) sup W ∈B(W * ,r) 1 n || n i=1 [∇ 2 (W j(W ) ; x i ) -∇ 2 (W ; x i )]|| = 3 t E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) sup W ∈B(W * ,r) ||∇ 2 (W j(W ) ; x i ) -∇ 2 (W ; x i )|| ≤ 3 t E sup W ∈B(W * ,r) ||∇ 2 (W j(W ) ; x i ) -∇ 2 (W ; x i )|| ||W -W j(W ) || F • sup W ∈B(W * ,r) ||W -W j(W ) || F ≤ C 12 • d 3 2 K 5 2 L l=1 λ l ( µ l ∞ + σ l ) 2 L l=1 λ l ( µ l ∞ + σ l ) 4 • t (141) Thus, taking t ≥ C 12 • d 3 2 K 5 2 L l=1 λ l ( µ l ∞ + σ l ) 2 L l=1 λ l ( µ l ∞ + σ l ) 4 • δ (142) ensures that P(A t ) ≤ δ 2 . 3) Final step Let = δ C12•d 3 2 K 5 2 √ L l=1 λ l ( µ l ∞+σl ) 2 L l=1 λ l ( µ l ∞+σl ) 4 •ndK and δ = d -10 , then from ( 138) and ( 142) we need t > max{ 1 ndK , C 6 • L l=1 λ l ( µ l ∞ + σ l ) 2 • dK log(36rnd 25 2 K 7 2 L l=1 λ l ( µ l ∞ + σ l ) 2 L l=1 λ l ( µ l ∞ + σ l ) 4 ) + log 4 δ n , dK log(36rnd 25 2 K 7 2 • L l=1 λ l ( µ l ∞ + σ l ) 2 L l=1 λ l ( µ l ∞ + σ l ) 4 ) + log 4 δ L l=1 λ l ( µ l ∞ + σ l ) 2 n } (143) So by setting t = L l=1 λ l ( µ l ∞ + σ l ) 2 dK log n n , as long as n ≥ C • dK log dK, we have P( sup W ∈B(W * ,r) ||∇ 2 f n (W ) -∇ 2 f (W )|| ≥ C 6 • L l=1 λ l ( µ l ∞ + σ l ) 2 dK log n n ) ≤ d -10 (144) D PROOF OF LEMMA 2 We first present a lemma used in proving Lemma 2 in Section D.1 and then prove Lemma 2 in Section D.2.

D.1 A USEFUL LEMMA USED IN THE PROOF

Lemma 9 If r is defined in (53) for 0 ∈ (0, 1 4 ), then with probability at least 1 -d -10 , we have 4 sup W ∈B(W * ,r) ||∇ fn (W ) -∇ f (W )|| ≤ C 13 • K L l=1 λ l ( µ ∞ + σ l ) 2 d log n n (1 + ξ) (145) for some constant C 13 > 0. Proof: Note that ∇ fn (W ) = ∇f n (W ) + 1 n n i=1 ν i , ∇ f (W ) = ∇f (W ) + E[ν i ] = ∇f (W ). Therefore, we have sup W ∈B(W * ,r) ||∇ fn (W ) -∇ f (W )|| ≤ sup W ∈B(W * ,r) ||∇f n (W ) -∇f (W )|| + 1 n n i=1 ν i (146) Then, similar to the idea of the proof of Lemma 8, we adopt an -covering net of the ball B(W * , r) to build a relationship between any arbitrary point in the ball and the points in the covering set. We can then divide the distance between ∇f n (W ) and ∇f (W ) into three parts, similar to (124). ( 147) to ( 149) can be derived in a similar way as ( 126) to ( 128), with "∇ 2 " replaced by "∇". Then we need to bound P(A t ), P(B t ) and P(C t ) respectively, where A t , B t and C t are defined below. A t = { sup W ∈B(W * ,r) 1 n || n i=1 [∇ (W ; x i ) -∇ (W j(W ) ; x i )]|| ≥ t 3 } (147) B t = { sup W ∈B(W * ,r) || 1 n n i=1 ∇ (W j(W ) ; x i ) -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ (W j(W ) ; x i )]|| ≥ t 3 } (148) C t ={ sup W ∈B(W * ,r) ||E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ (W j(W ) ; x i )] -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ (W ; x i )]|| ≥ t 3 } (a) Upper bound of P(B t ). Applying Lemma 3 in (Mei et al., 2018a) , we have || 1 n n i=1 ∇ (W j(W ) ; x i ) -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ (W j(W ) ; x i )]|| ≤2 sup v∈V 1 2 1 n n i=1 ∇ (W j(W ) ; x i ) -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ (W j(W ) ; x i )], v Define G i = v, (∇ (W , x i ) -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [∇ (W , x i )] ) . Here v ∈ R d . To compute ∇ (W , x i ), we require the derivation in Property 6. Then we can have an upper bound of ζ(W ) in (43). ζ(W ) =    -1 K 1 H(W ) φ (w j x) ≤ φ(w j x)(1-φ(w j x)) K• 1 K φ(w j x) ≤ 1, y = 1 1 K 1 1-H(W ) φ (w j x) ≤ φ(w j x)(1-φ(w j x)) K• 1 K (1-φ(w j x)) ≤ 1, y = 0 Then we have an upper bound of G i . |G i | = ζ j,l v x -E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [ζv x] ≤ |v x| + E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) [|v x|] (152) Following the idea of ( 132) and ( 133), and by v ∈ V 1 2 , we have E[|G i | p ] ≤ L l=1 λ l ( µ l ∞ + σ l ) 2 p 2 (153) E[exp(θG i )] ≤ 1 + |eθ 2 | L l=1 λ l ( µ l ∞ + σ l ) 2 where (154) holds if θ ≤ 27 4e L l=1 λ l ( µ l ∞ + σ l ) 2 . Following the derivation of ( 130) and ( 135) to (138), we have P(| 1 n n i=1 G i | ≥ t 6 ) ≤2 exp max - C 14 nt 2 144 L l=1 λ l ( µ l ∞ + σ l ) 2 , -C 15 n L l=1 λ l ( µ l ∞ + σ l ) 2 • t for some constant C 14 > 0 and C 15 > 0. Moreover, we can obtain P(B t ) ≤ δ 2 as long as t ≥ C 13 • max{ L l=1 λ l ( µ l ∞ + σ l ) 2 dK log 18r + log 4 δ n , dK log 18r + log 4 δ L l=1 λ l ( µ l ∞ + σ l ) 2 • n } (156) (b) For the upper bound of P(A t ) and P(C t ), we can first derive E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) sup W =W ∈B(W * ,r) ||∇ (W , x) -∇ (W , x)|| ||W -W || F ≤E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) sup W =W ∈B(W * ,r) |ζ(W ) -ζ(W )| • ||x|| ||W -W || F ≤E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) sup W =W ∈B(W * ,r) max 1≤j,l≤K {|ξ j,l (W )|} • ||x|| 2 √ K||W -W || F ||W -W || F ≤E x∼ L l=1 λ l N (µ l ,σ 2 l I d ) sup W =W ∈B(W * ,r) C 9 • ||x|| 2 √ K||W -W || F ||W -W || F ≤C 9 • 3 √ Kd • L l=1 λ l ( µ l ∞ + σ l ) 2 (157) The first inequality is by (43). The second inequality is by the Mean Value Theorem. The third step is by (93). The last inequality is by Property 5. Therefore, following the steps in part (2) of Lemma 8, we can conclude that C t holds if t ≥ 3C 9 • √ Kd • L l=1 λ l ( µ l ∞ + σ l ) 2 • (158) Moreover, from (142) in Lemma 8 we have that Following the proof of Theorem 2 in [Fu et al. (2020) ], first we have the Taylor's expansion of t ≥ 18C 9 • √ Kd • L l=1 λ l ( µ l ∞ + σ l ) 2 • δ f n ( W n ) f n ( W n ) =f n (W * ) + ∇ fn (W * ), vec( W n -W * ) + 1 2 vec( W n -W * )∇ 2 f n (W )vec( W n -W * ) ( ) Here W is on the straight line connecting W * and W n . By the fact that f n ( W n ) ≤ f n (W * ), we have 1 2 vec( W n -W * )∇ 2 f n (W )vec( W n -W * ) ≤ ∇f n (W * ) vec( W n -W * ) From Lemma 7 and Lemma 9, we have  * ) + ∇ f (W * ) ) • W n -W * F ≤O K L l=1 λ l ( µ l ∞ + σ l ) 2 d log n n (1 + ξ) || W n -W * || F The second to last step of (166) comes from the triangle inequality and the last step follows from the fact ∇f (W * ) = 0. Combining ( 164), ( 165) and ( 166), we have || W n -W * || F ≤ O K 5 2 L l=1 λ l ( µ l ∞ + σ l ) 2 (1 + ξ) L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) d log n n Therefore, we have concluded that there indeed exists a critical point W in B(W * , r). Then we show the linear convergence of Algorithm 1 as below. By the update rule, we have W t+1 -W n = W t -η 0 (∇f n (W t ) + 1 n n i=1 ν i ) -( W n -η 0 ∇f n ( W n )) = I -η 0 1 0 ∇ 2 f n (W (γ)) (W t -W n ) - η 0 n n i=1 ν i where W (γ) = γ W n + (1 -γ)W t for γ ∈ (0, 1). Since W (γ) ∈ B(W * , r), by Lemma 1, we have H min • I ∇ 2 f n (W (γ)) ≤ H max • I (169) where Therefore, with probability 1 -d -10 we can derive H min = Ω 1 K 2 L l=1 λ l σ 2 l ηκ 2 ρ( W * µ l σ l δ K (W * ) , σ l δ K (W * )) , H max = C 4 • L l=1 λ l ( µ l ∞ + σ l ) 2 . Therefore, ||W t+1 -W n || F = ||I -η 0 1 0 ∇ 2 f n (W (γ))|| • ||W t -W n || F + η 0 n n i=1 ν i F ≤ (1 -η 0 H min )||W t -W n || F + η 0 n n i=1 ν i F || W t -W n || F ≤ (1 - H min H max ) t ||W 0 -W n || F + H max η 0 H min dK log n n ξ E PROOF OF LEMMA 3 We need Lemma 10 to Lemma 14, which are stated in Section E.1, for the proof of Lemma 3. Section E.2 summarizes the proof of Lemma 3. The proofs of Lemma 10 to Lemma 12 are provided in Section E.3 to Section E.5. Lemma 13 and Lemma 14 are cited from (Zhong et al., 2017b) . Although (Zhong et al., 2017b) considers the standard Gaussian distribution, the proofs of Lemma 13 and 14 hold for any data distribution. Therefore, these two lemmas can be applied here directly. The tensor initialization in (Zhong et al., 2017b) only holds for the standard Gaussian distribution. We exploit a more general definition of tensors from (Janzamin et al. ( 2014)) for the tensor initialization in our algorithm. We also develop new error bounds for the initialization.

E.1 USEFUL LEMMAS IN THE PROOF

Lemma 10 Let P 2 follow Definition 1. Let S be a set of i.i.d. samples generated from the mixed Gaussian distribution L l=1 λ l N (µ l , σ 2 l I). Let P 2 be the empirical version of P 2 using data set S. Then with probability at least 1 -2n -Ω(δ 4 1 d) , we have ||P 2 -P 2 || d log n n • δ 2 1 • τ 6 D 2 (λ, M , σ)D 4 (λ, M , σ) Lemma 11 Let U ∈ E d×K be the orthogonal column span of W * . Let α be a fixed unit vector and U ∈ R d×K denote an orthogonal matrix satisfying ||U U -U U || ≤ 1 4 . Define R 3 = M 3 ( U , U , U ), where M 3 is defined in Definition 1. Let R 3 be the empirical version of R 3 using data set S, where each sample of S is i.i.d. sampled from the mixed Gaussian distribution L l=1 λ l N (µ l , σ 2 l I). Then with probability at least 1 -n -Ω(δ 4 ) , we have || R 3 -R 3 || δ 2 1 • τ 6 D 6 (λ, M , σ) • log n n ( ) Lemma 12 Let 1 be the empirical version of M 1 using dataset S. Then with probability at least 1 -2n -Ω(d) , we have || M 1 -M 1 || τ 2 D 2 (λ, M , σ) • d log n n (176) Lemma 13 ((Zhong et al., 2017b) , Lemma E.6) Let P 2 be defined in Definition 1 and P 2 be its empirical version. Let U ∈ R d×K be the column span of W * . Assume ||P 2 -P 2 || ≤ δ K (P2) 10 . Then after T = O(log( 1 )) iterations, the output of the Tensor Initialization Method 3, U will satisfy  d log n n • δ 2 1 δ 2 K • τ 6 D 2 (λ, M , σ)D 4 (λ, M , σ) = d log n n • κ 2 • τ 6 D 2 (λ, M , σ)D 4 (λ, M , σ) Moreover, we have U wj * -v j ≤ K 3 2 δ 2 K (W * ) ||R 3 -R 3 || κ 2 • τ 6 D 6 (λ, M , σ) • K 3 log n n (182) in which the first step is by Theorem 3 in [Kuleshov et al. (2015) ] and the second step is by Lemma 11. By Lemma 14, we have 



By mild we mean given L, if Assumption 1 is not met for some (λ0, M0, σ0), there exists an infinite number of (λ , M , σ ) in any neighborhood of (λ0, M0, σ0) such that Assumption 1 holds for (λ , M , σ ), (1 + ξ) • d log n/n .(13) σmin and σmax denote the minimum and maximum among {σ1, • • • , σL}, respectively. τ = σmax σ min ∇ fn(W ) is defined as 1 n n i=1 (∇l(W , xi, yi) + νi) in algorithm 1



We vary d and evaluate the sample complexity bound in (11) with respect to d. We randomly initialize M times and let W (m) n denote the output of Algorithm 1 in the mth trail. Let Wn denote the mean values of all W (m) n , and let d W = M m=1 || w m n -Wn || 2 /M denote the variance. An experiment is successful if d W ≤ 10 -4 and fails otherwise. M is set as 20.We vary d and the number of samples n. For each pair of d and n, 20 independent sets of W * and the corresponding training samples are generated. Fig.2shows the success rate of these independent experiments. A black block means that all the experiments fail. A white block means that they all succeed. The sample complexity is indeed almost linear in d, as predicted by (11). Moreover, the coefficient n/d can be large depending on the problem setup.

Figure 1: Comparison between gradient descent with tensor initialization and random initialization

Figure 4: (a) The convergence rate with different µ, (b) The convergence rate with different σ.

Figure 5: Convergence rate when the number of neurons K changes

is averaged over 100 independent experiments of different W * and the corresponding training set. W * F is normalized to 1. The error is indeed linear in log(n)/n, as predicted by (12).

123) C.7 PROOF OF LEMMA 8 Let N be the -covering number of the Euclidean ball B(W * , r). It is known that log N ≤ dK log( 3r ) from (Vershynin, 2010). Let W = {W 1 , ..., W N } be the -cover set with N elements. For any W ∈ B(W * , r), let j(W ) = arg min j∈[N ]

λ l ( µ l ∞+σl ) 2 • •ndK , δ = d -10 and t = C 13 K L l=1 λ l ( µ l ∞ + σ l ) 2 d log n n , if n ≥ C • dK log dK for some constant C > 0, we have P( sup W ∈B(W * ,r) ||∇f n (W ) -∇f (W )||) ≥ C 13 • K L l=1 λ l ( µ l ∞ + σ l ) 2 d log n n ≤ d -10(160) By Hoeffding's inequality in(Vershynin, 2010)  and Property 1, we haveP 1 n n i=1 ν i F ≥ C 13 • L l=1 λ l ( µ l ∞ + σ l ) 2 dK log n n ξ exp(-C 2 13 • L l=1 λ l ( µ l ∞ + σ l ) 2 ξ 2 dK log n dKξ 2 ) ∈B(W * ,r) ||∇ fn (W ) -∇ f (W )|| ≤ C 13 • K L l=1 λ l ( µ l ∞ + σ l ) 2 d log n

* µ l σ l δ K (W * ) , σ l δ K (W * ))|| W n -W * || 2 n -W * )∇ 2 f n (W )vec( W n -W * ) (165) and ∇ fn (W * ) vec( W n -W * ) ≤ ∇ fn (W * ) • W n -W * F ≤( ∇ fn (W * ) -∇ f (W

the union bound of failure probabilities in Lemmas 10, 11 and 12 and byD 2 (λ, M , σ)D 4 (λ, M , σ) ≤ D 6 (λ, M , σ) from Property 7, we have that if the sample size n ≥ κ 8 K 4 τ 12 D 6 (λ, M , σ) • d log 2 d, then the output W 0 ∈ R d×K satisfies ||W 0 -W * || κ 6 K 3 • τ 6 D 6 (λ, M , σ) d log n n ||W * ||(184)

( µ l ∞+σl ) 2 , we obtain|| W t+1 -W n || F ≤ (1 -H min H max )||W t -W n || F + η 0 nTherefore, Algorithm 1 converges linearly to the local minimizer with an extra statistical error. By Hoeffding's inequality in(Vershynin, 2010)  and Property 1, we have

Lemma 14((Zhong et al., 2017b), Lemma E.13) Let U ∈ R d×K be the orthogonal column span of W * . Let U ∈ R d×K be an orthogonal matrix such that||U U -U U || γ 1 1 κ 2 √ K . For each i ∈ [K], let v i denote the vector satisfying || v i -U wi * || ≤ γ 2 1 κ 2 √ K . Let M 1 be defined in Lemma 12 and M 1 be its version. If ||M 1 -M 1 || ≤ γ 3 ||M 1 || 1 4 ||M 1 ||, then we have ||w * i || -α i ≤ (κ 4 K 3 2 (γ 1 + γ 2 ) + κ 2 K + ||w * j || -α j(180) By Lemma 10, Lemma 13 and δ K (P 2 ) δ 2

annex

with probability at least 1 -n -Ω(δ 4 1 )

E.3 PROOF OF LEMMA 10

From Assumption 1, if the Gaussian Mixture Model is a symmetric probability distribution defined in (8), then P 2 = M 3 (I, I, α). Therefore, by Definition 1, we have) ⊗σ -2 l I (I, I, α)) ⊗σ -2 l I (I, I, α) (185) Following (Zhong et al., 2017b) , ⊗ is defined such that for any v ∈ R d1 andwhere z i is the i-th column of Z. By Definition 1, we have)) is the dominant term of the entire expression, and y ≤ 1. The second step is because the expression can be considered as a normalized weighted summation of (, where p(x) is the probability density function of the random variable x.From Definition 1, we can verify thatThen defineSimilar to the proof of ( 131), ( 132) and ( 133) in Lemma 8, we havesimilar to the derivation of (135), we have 193) with probability at least 1 -2n -Ω(δ 4 1 d) . If the Gaussian Mixture Model is not a symmetric distribution which is defined in (8), then P 2 = M 2 . We would have a similar result as follows.Then defineSimilar to the proof of ( 131), ( 132) and ( 133) in Lemma 8, we haveHence, similar to the derivation of (135), we havewith probability at least 1 -2n -Ω(δ 4 1 d) . To sum up, from ( 193) and ( 200) we havewith probability at least 1 -2n -Ω(δ 4 1 d) .

E.4 PROOF OF LEMMA

We consider each component of yWe flatten T i (x) : R d → R K×K×K along the first dimension to obtain function. Similar to the derivation of the last step of Lemma E.8 in (Zhong et al., 2017b) , we can obtain T i (x) ≤ B i (x) . By (185), we haveDefine, where ||v|| = 1, so E[Gr i ] = 0. Similar to the proof of ( 131), ( 132) and ( 133) in Lemma 8, we haveHence, similar to the derivation of ( 135), we havefor some constant C 18 > 0. Let θ = t C18 τ 6 √ D6(λ,M ,σ)with probability at least 1 -2n -Ω(δ 4 1 ) .

E.5 PROOF OF LEMMA 12

From the Definition 1, we haveBased on Definition 1,, where ||v|| = 1, so E[Gq i ] = 0. Similar to the proof of ( 131), ( 132) and ( 133) in Lemma 8, we have with probability at least 1 -2n -Ω(d) .

