LEARNING ONE-HIDDEN-LAYER NEURAL NETWORKS ON GAUSSIAN MIXTURE MODELS WITH GUARAN

Abstract

We analyze the learning problem of fully connected neural networks with the sigmoid activation function for binary classification in the teacher-student setup, where the outputs are assumed to be generated by a ground-truth teacher neural network with unknown parameters, and the learning objective is to estimate the teacher network model by minimizing a non-convex cross-entropy risk function of the training data over a student neural network. This paper analyzes a general and practical scenario that the input features follow a Gaussian mixture model of a finite number of Gaussian distributions of various mean and variance. We propose a gradient descent algorithm with a tensor initialization approach and show that our algorithm converges linearly to a critical point that has a diminishing distance to the ground-truth model with guaranteed generalizability. We characterize the required number of samples for successful convergence, referred to as the sample complexity, as a function of the parameters of the Gaussian mixture model. We prove analytically that when any mean or variance in the mixture model is large, or when all variances are close to zero, the sample complexity increases, and the convergence slows down, indicating a more challenging learning problem. Although focusing on one-hidden-layer neural networks, to the best of our knowledge, this paper provides the first explicit characterization of the impact of the parameters of the input distributions on the sample complexity and learning rate.

1. INTRODUCTION

Deep neural networks (LeCun et al., 2015) have demonstrated superior empirical performance in various applications such as speech recognition (Krizhevsky et al., 2012) and computer vision (Graves et al., 2013; He et al., 2016) . Despite the numerical success, the theoretical underpin of learning neural networks is much less investigated. One bottleneck for the wide acceptance of deep learning in critical applications is the lack of the theoretical generalization guarantees, i.e., why a model learned from the training data would achieve a high accuracy on the testing data. This paper studies the generalization performance of neural networks in the "teacher-student" setup, where the training data are generated by a teacher neural network, and the learning is performed on a student network by minimizing the empirical risk of the training data. This teacher-student setup has been studied in the statistical learning community for a long time (Engel & Broeck, 2001; Seung et al., 1992) and applied to neural networks recently (Goldt et al., 2019a; Zhong et al., 2017b; a; Zhang et al., 2019; 2020b; Fu et al., 2020; Zhang et al., 2020a) . Assuming that the student network has the same architecture as the teacher network, the existing generalization analyses mostly focus on one-hidden-layer networks, because the optimization problem is already nonconvex, and the analytical complexity increases tremendously when the number of hidden layers increases. One critical assumption of most works in this line is that the input features follow the standard Gaussian distribution. Although other distributions are considered in (Du et al., 2017; Ghorbani et al., 2020; Goldt et al., 2019b; Li & Liang, 2018; Mei et al., 2018b; Mignacco et al., 2020; Yoshida & Okada, 2019) , the generalization performance beyond the standard Gaussian input is less investigated. On the other hand, the learning performance clearly depends on the input data distribution. (LeCun et al., 1998) states that the learning method converges faster if the inputs are whitened to

