ANALYSIS OF ALIGNMENT PHENOMENON IN SIMPLE TEACHER-STUDENT NETWORKS WITH FINITE WIDTH Anonymous

Abstract

Recent theoretical analysis suggests that ultra-wide neural networks always converge to global minima near the initialization under first order methods. However, the convergence property of neural networks with finite width could be very different. The simplest experiment with two-layer teacher-student networks shows that the input weights of student neurons eventually align with one of the teacher neurons. This suggests a distinct convergence nature for "not-too-wide" neural networks that there might not be any local minima near the initialization. As the theoretical justification, we prove that under the most basic settings, all student neurons must align with the teacher neuron at any local minima. The methodology is extendable to more general cases, where the proof can be reduced to analyzing the properties of a special class of functions that we call Angular Distance (AD) function. Finally, we demonstrate that these properties can be easily verified numerically.

1. INTRODUCTION

The theoretical understanding to the training of neural networks has been a major and long-standing challenge. A recent line of research (Du et al., 2019c; b; a; Arora et al., 2019a; b) presented a major theoretical breakthrough via an elegant approach to characterize the training procedure of ultrawide neural networks. At the high level, the training loss profile of an ultra-wide neural network (m = Ω(N 6 ), N is the size of the training set) uniformly converges to zero. During the training procedure, almost all neurons' weight vectors remain close to their initialization, which in term preserves the uniform converging rate. As a result, the training loss converges to zero with almost all neurons' weight vectors near their initialization. Does the elegant theory reveal the fundamental mechanics behind the success of practical neural networks whose widths are finite (m = O(N ))? The theory suggests a clear property that is critical to the uniform convergence: The weight vectors of the neurons rarely move away from their initialization. In this paper, we first examine this conjecture with experiments. Unfortunately, a simple experiment with two-layer teacher-student networks exhibits contradicting properties: Despite of their randomized initialization, the weight vectors of the student neurons eventually align with the weight vector of one of the teacher neurons. In other words, almost all the student weight vectors end up somewhere far away from where they begin: from the randomized initialization to some specific teacher weight vector. We emphasize that such experiments with teacher-student models are generic in the sense that, according to the universal approximation theorem, any target function can be equivalently described as a two-layer neural network with ReLU activation. Although such a teacher network may not be accessible for an arbitrary dataset, the teacher-student model is sufficiently representative for empirical justification. Similar alignments are also observed for over-parameterized student and teacher networks of multiple layers (Tian, 2020). In other words, the neuron weights alignment might be the more appropriate fundamental mechanics for neural networks with finite width. In fact, once the student-teacher alignment has been established, it is straightforward that the training loss converges to zero. We then investigate the other direction of the observation: Does the convergence of gradient descent in this case imply a perfect alignment between the student and the teacher neurons? To the best of our knowledge, the theoretical understanding to this question is extremely limited even for the simplest 1-teacher 2-student case. In this paper, we initiate the study of the fundamental yet highly non-trivial question with the most basic 1-teacher m-student setting. Both the teacher and students are two-layer neural networks with ReLU activation. Mathematically, the question is equivalent to whether there exist any nonalignment solutions to the set of equations defining zero gradients. We demonstrate a complete proof of non-existence for m = 2, and rule out the existence of solutions with special form for general m ≥ 3: (i) all student neuron weight vectors fall in the same half-plane; or (ii) the angles between any two weight vectors are rational multiples of π. For general cases, we show that the theoretical problem can be reduced to the analysis of a special class of functions which we call angular-distance (AD) functions: If one of the associated matrices of the AD function has a non-positive determinant, then there is no non-trivial solution to the set of equations. Hence no local minima without perfect alignment. In light of the reduction, we numerically examined the property that the minimum determinant of the associated matrices are always non-positive unless perfect alignment or problem degenerated. Moreover, the minimum determinant behaves as a potential function in the sense that the further from being degenerated, the further from being non-negative.

1.1. RELATED WORKS

Recent papers have made tremendous progress on the theoretical analysis of sufficiently overparametrized neural networks or those with even infinite width. With some mild assumptions, it has been proved that the training error can uniformly converge to zero via gradient descent (Allen-Zhu et al., 2019b; Du et al., 2018b; 2019c; b; Du and Hu, 2019; Lee et al., 2019; Li and Liang, 2018; Zou et al., 2018) . Surprisingly, such a uniform convergence implies there always exist global minima near a random initialization, generalization bounds also emerge under the same frameworks (Allen-Zhu et al., 2019a; Arora et al., 2019a; Cao and Gu, 2019) . Moreover, as the network width approaches to infinity, fundamental connections between over-parameterized neural networks and kernel methods are also discovered, which hint potential correspondence between the generality of deep learning and that of kernel methods (Arora et al., 2019b; Daniely et al., 2016; Daniely, 2017; Du et al., 2019a; Hazan and Jaakkola, 2015; Jacot et al., 2018; Mairal et al., 2014; Yang, 2019) . Despite the beauty of the theoretical results, empirical studies exhibit certain mismatches with the theory. Nagarajan and Kolter (2019) shows that the uniform convergence property may be insufficient to explain the generalization ability of neural network. Moreover, Tian (2020) considers using teacher networks to train over-parametrized student networks of the same depth and demonstrates that almost all student neurons either align with some teacher neuron or have no contribution to the final output. Our theoretical results derive from analyzing the behaviour of gradient descent assuming that the input distribution is Gaussian, following the methodology commonly used in many recent works (Brutzkus and Globerson, 2017; Du et al., 2017; Du and Lee, 2018; Du et al., 2018a; c; Li and Yuan, 2017; Tian, 2017; Zhong et al., 2017) .



Figure 1: The weight vectors of teacher neurons (dotted lines) and student neurons (solid lines) in epoch 0, 10, 1000, 5000, 50000 respectively.

