LEARNING ONE-HIDDEN-LAYER NEURAL NETWORKS ON GAUSSIAN MIXTURE MODELS WITH GUARAN

Abstract

We analyze the learning problem of fully connected neural networks with the sigmoid activation function for binary classification in the teacher-student setup, where the outputs are assumed to be generated by a ground-truth teacher neural network with unknown parameters, and the learning objective is to estimate the teacher network model by minimizing a non-convex cross-entropy risk function of the training data over a student neural network. This paper analyzes a general and practical scenario that the input features follow a Gaussian mixture model of a finite number of Gaussian distributions of various mean and variance. We propose a gradient descent algorithm with a tensor initialization approach and show that our algorithm converges linearly to a critical point that has a diminishing distance to the ground-truth model with guaranteed generalizability. We characterize the required number of samples for successful convergence, referred to as the sample complexity, as a function of the parameters of the Gaussian mixture model. We prove analytically that when any mean or variance in the mixture model is large, or when all variances are close to zero, the sample complexity increases, and the convergence slows down, indicating a more challenging learning problem. Although focusing on one-hidden-layer neural networks, to the best of our knowledge, this paper provides the first explicit characterization of the impact of the parameters of the input distributions on the sample complexity and learning rate.

1. INTRODUCTION

Deep neural networks (LeCun et al., 2015) have demonstrated superior empirical performance in various applications such as speech recognition (Krizhevsky et al., 2012) and computer vision (Graves et al., 2013; He et al., 2016) . Despite the numerical success, the theoretical underpin of learning neural networks is much less investigated. One bottleneck for the wide acceptance of deep learning in critical applications is the lack of the theoretical generalization guarantees, i.e., why a model learned from the training data would achieve a high accuracy on the testing data. This paper studies the generalization performance of neural networks in the "teacher-student" setup, where the training data are generated by a teacher neural network, and the learning is performed on a student network by minimizing the empirical risk of the training data. This teacher-student setup has been studied in the statistical learning community for a long time (Engel & Broeck, 2001; Seung et al., 1992) and applied to neural networks recently (Goldt et al., 2019a; Zhong et al., 2017b; a; Zhang et al., 2019; 2020b; Fu et al., 2020; Zhang et al., 2020a) . Assuming that the student network has the same architecture as the teacher network, the existing generalization analyses mostly focus on one-hidden-layer networks, because the optimization problem is already nonconvex, and the analytical complexity increases tremendously when the number of hidden layers increases. One critical assumption of most works in this line is that the input features follow the standard Gaussian distribution. Although other distributions are considered in (Du et al., 2017; Ghorbani et al., 2020; Goldt et al., 2019b; Li & Liang, 2018; Mei et al., 2018b; Mignacco et al., 2020; Yoshida & Okada, 2019) , the generalization performance beyond the standard Gaussian input is less investigated. On the other hand, the learning performance clearly depends on the input data distribution. (LeCun et al., 1998) states that the learning method converges faster if the inputs are whitened to be the standard Gaussian. Batch normalization (Ioffe & Szegedy, 2015) modifies the mean and variance in each layer and is a popular practical method to achieve a fast and stable convergence. Various explanations such as (Bjorck et al., 2018; Chai et al., 2020; Santurkar et al., 2018) have been proposed to explain the enormous success of Batch normalization, but little consensus exists on the exact mechanism. Contributions: This paper provides a theoretical analysis of learning one-hidden-layer neural networks when the input distribution follows a Gaussian mixture model containing an arbitrary number of Gaussian distributions with arbitrary mean and variance. The Gaussian mixture model has been employed in many applications such as data clustering and unsupervised learning (Dasgupta, 1999; Figueiredo & Jain, 2002; Jain, 2010) , and image classification and segmentation (Permuter et al., 2006) . The parameters of the mixture model can be estimated from data by the EM algorithm (Redner & Walker, 1984) or the moment-based method (Hsu & Kakade, 2013), with theoretical performance guarantees, see, e.g., (Ho & Nguyen, 2016; Ho et al., 2020; Dwivedi et al., 2020a; b) . For the binary classification problem with the cross entropy loss function, this paper proposes a gradient descent algorithm with tensor initialization to estimate the weights of the one-hidden-layer fully-connected neural network. Our algorithm converges to a critical point linearly, and the returned critical point converges to the ground-truth model at a rate of d log n/n, where d is the dimension of the feature, and n is the number of samples. We also characterize the required number of samples for accurate estimation, referred to as the sample complexity, as a function of d, the number of neurons K, and the input distribution. Our explicit bounds imply (1) when the absolute value of any mean in the Gaussian mixture model increases from zero, the sample complexity increases, and the algorithm converges slower, indicating that it will be more challenging to learn a model with a small test error; (2) The same phenomenon happens when any variance in the mixture model increases to infinity from a certain positive value, or if all the variances in the mixture model approach zero. Our results indicate that the training converges faster and requires a less number of samples if the input data are zero mean with a certain non-zero variance. This can be viewed as one theoretical explanation in one-hidden-layer for the success of Batch normalization. Moreover, to the best of our knowledge, this paper provides the first theoretical and explicit characterization about how the mean and variance of the input distribution affect the sample complexity and learning rate.

1.1. RELATED WORK

Learning over-parameterized neural networks. One line of theoretical research on the learning performance considers the over-parameterized setting where the number of network parameters is greater than the number of training samples. (Bousquet & Elisseeff, 2002; Hardt et al., 2016; Keskar et al., 2016; Livni et al., 2014; Neyshabur et al., 2017; Rumelhart et al., 1988; Soltanolkotabi et al., 2018; Allen-Zhu et al., 2019a) . (Allen-Zhu et al., 2019b; Du et al., 2019; Zou & Gu, 2019) show the deep neural networks can fit all training samples in polynomial time. The optimization problem has no spurious local minima (Livni et al., 2014; Zhang et al., 2016; Soltanolkotabi et al., 2018) , and the global minimum of the empirical risk function can be obtained by gradient descent (Li & Yuan, 2017; Du et al., 2018b; Zou et al., 2020) . Although the returned model can achieve a zero training error, these works do not discuss whether it achieves a small test error or not. (Allen-Zhu et al., 2019a; Li & Liang, 2018) analyze the generalization error by characterizing the training error and test error separately. Still, there is no guarantee that a learned model with a small training error would have a small test error. (Cao & Gu, 2019) provides the bounds of the generalization error of the learned model by stochastic gradient descent (SGD) in deep neural networks, based on the assumption that there exists a good model with a small test error around the initialization of the SGD algorithm, and no discussion is provided about how to find such an initialization. In contrast, our tensor initialization method in this paper provides an initialization that is close to the ground-truth teacher model such that our algorithm can find this model with a zero test error. Generalization performance with the standard Gaussian input. In the teacher-student setup of one-hidden-layer neural networks, (Brutzkus & Globerson, 2017; Du et al., 2018a; Ge et al., 2018; Liang et al., 2018; Li & Yuan, 2017; Shamir, 2018; Safran & Shamir, 2018; Tian, 2017) consider the ideal case of an infinite number of training samples so that the training and test accuracy coincide and can be analyzed simultaneously. When the number of training samples is finite, (Zhong et al., 2017b; a) characterize the sample complexity, i.e., the required number of samples, of learning one-hidden-layer fully connected neural networks with smooth activation functions and propose a

