ANALYSIS OF ALIGNMENT PHENOMENON IN SIMPLE TEACHER-STUDENT NETWORKS WITH FINITE WIDTH Anonymous

Abstract

Recent theoretical analysis suggests that ultra-wide neural networks always converge to global minima near the initialization under first order methods. However, the convergence property of neural networks with finite width could be very different. The simplest experiment with two-layer teacher-student networks shows that the input weights of student neurons eventually align with one of the teacher neurons. This suggests a distinct convergence nature for "not-too-wide" neural networks that there might not be any local minima near the initialization. As the theoretical justification, we prove that under the most basic settings, all student neurons must align with the teacher neuron at any local minima. The methodology is extendable to more general cases, where the proof can be reduced to analyzing the properties of a special class of functions that we call Angular Distance (AD) function. Finally, we demonstrate that these properties can be easily verified numerically.

1. INTRODUCTION

The theoretical understanding to the training of neural networks has been a major and long-standing challenge. A recent line of research (Du et al., 2019c; b; a; Arora et al., 2019a; b) presented a major theoretical breakthrough via an elegant approach to characterize the training procedure of ultrawide neural networks. At the high level, the training loss profile of an ultra-wide neural network (m = Ω(N 6 ), N is the size of the training set) uniformly converges to zero. During the training procedure, almost all neurons' weight vectors remain close to their initialization, which in term preserves the uniform converging rate. As a result, the training loss converges to zero with almost all neurons' weight vectors near their initialization. Does the elegant theory reveal the fundamental mechanics behind the success of practical neural networks whose widths are finite (m = O(N ))? The theory suggests a clear property that is critical to the uniform convergence: The weight vectors of the neurons rarely move away from their initialization. In this paper, we first examine this conjecture with experiments. Unfortunately, a simple experiment with two-layer teacher-student networks exhibits contradicting properties: Despite of their randomized initialization, the weight vectors of the student neurons eventually align with the weight vector of one of the teacher neurons. In other words, almost all the student weight vectors end up somewhere far away from where they begin: from the randomized initialization to some specific teacher weight vector. We emphasize that such experiments with teacher-student models are generic in the sense that, according to the universal approximation theorem, any target function can be equivalently described as a two-layer neural network with ReLU activation. Although such a teacher network may not be accessible for an arbitrary dataset, the teacher-student model is sufficiently representative for empirical justification. Similar alignments are also observed for over-parameterized student and teacher networks of multiple layers (Tian, 2020) . In other words, the neuron weights alignment might be the more appropriate fundamental mechanics for neural networks with finite width. In fact, once the student-teacher alignment has been established, it is straightforward that the training loss converges to zero. We then investigate the other direction of the observation: Does the convergence of gradient descent in this case imply a perfect alignment between the student and the teacher neurons? To the best

