ON ALIGNMENT IN DEEP LINEAR NEURAL NET-WORKS

Abstract

We study the properties of alignment, a form of implicit regularization, in linear neural networks under gradient descent. We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018. While in fully connected networks, there always exists a global minimum corresponding to an aligned solution, we analyze alignment as it relates to the training process. Namely, we characterize when alignment is an invariant of training under gradient descent by providing necessary and sufficient conditions for this invariant to hold. In such settings, the dynamics of gradient descent simplify, thereby allowing us to provide an explicit learning rate under which the network converges linearly to a global minimum. We then analyze networks with layer constraints such as convolutional networks. In this setting, we prove that gradient descent is equivalent to projected gradient descent, and that alignment is impossible with sufficiently large datasets.

1. INTRODUCTION

Although overparameterized deep networks can interpolate randomly labeled training data (Du et al., 2019; Wu et al., 2019) , training overparameterized networks with modern optimizers often leads to solutions that generalize well. This suggests that there is a form of implicit regularization occurring through training (Zhang et al., 2017) . As an example of implicit regularization, the authors in Ji & Telgarsky (2018) proved that the layers of linear neural networks used for binary classification on linearly separable datasets become aligned in the limit of training. That is, for a linear network parameterized by the matrix product W d W d-1 . . . W 1 , the top left/right singular vectors u i and v i of layer W i satisfy |v T i+1 u i | → 1 as the number of gradient descent steps goes to infinity. Alignment of singular vector spaces between adjacent layers allows for the network representation to be drastically simplified (see Equation 3); namely, the product of all layers becomes a product of diagonal matrices with the exception of the outermost unitary matrices. If alignment is an invariant of training, then optimization over the set of weight matrices reduces to optimization over the set of singular values of weight matrices. Thus, importantly, alignment of singular vector spaces allows for the gradient descent update rule to be simplified significantly, which was used in Ji & Telgarsky (2018) to show convergence to a max-margin solution. In this work, we generalize the definition of alignment to the multidimensional setting. We study when alignment can occur and moreover, under which conditions it is an invariant of training in linear neural networks under gradient descent. Prior works (Gidel et al., 2019; Saxe et al., 2014; 2019) have implicitly relied on invariance of alignment as an assumption on initialization to simplify training dynamics for 2 layer networks. In this work, we provide necessary and sufficient conditions for when alignment is an invariant for networks of arbitrary depth. Our main contributions are as follows: 1. We extend the definition of alignment from the 1-dimensional classification setting to the multi-dimensional setting (Definition 2) and characterize when alignment is an invariant of training in linear fully connected networks with multi-dimensional outputs (Theorem 1). 2. We demonstrate that alignment is an invariant for fully connected networks with multidimensional outputs only in special problem classes including autoencoding, matrix factorization

