ON ALIGNMENT IN DEEP LINEAR NEURAL NET-WORKS

Abstract

We study the properties of alignment, a form of implicit regularization, in linear neural networks under gradient descent. We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018. While in fully connected networks, there always exists a global minimum corresponding to an aligned solution, we analyze alignment as it relates to the training process. Namely, we characterize when alignment is an invariant of training under gradient descent by providing necessary and sufficient conditions for this invariant to hold. In such settings, the dynamics of gradient descent simplify, thereby allowing us to provide an explicit learning rate under which the network converges linearly to a global minimum. We then analyze networks with layer constraints such as convolutional networks. In this setting, we prove that gradient descent is equivalent to projected gradient descent, and that alignment is impossible with sufficiently large datasets.

1. INTRODUCTION

Although overparameterized deep networks can interpolate randomly labeled training data (Du et al., 2019; Wu et al., 2019) , training overparameterized networks with modern optimizers often leads to solutions that generalize well. This suggests that there is a form of implicit regularization occurring through training (Zhang et al., 2017) . As an example of implicit regularization, the authors in Ji & Telgarsky (2018) proved that the layers of linear neural networks used for binary classification on linearly separable datasets become aligned in the limit of training. That is, for a linear network parameterized by the matrix product W d W d-1 . . . W 1 , the top left/right singular vectors u i and v i of layer W i satisfy |v T i+1 u i | → 1 as the number of gradient descent steps goes to infinity. Alignment of singular vector spaces between adjacent layers allows for the network representation to be drastically simplified (see Equation 3); namely, the product of all layers becomes a product of diagonal matrices with the exception of the outermost unitary matrices. If alignment is an invariant of training, then optimization over the set of weight matrices reduces to optimization over the set of singular values of weight matrices. Thus, importantly, alignment of singular vector spaces allows for the gradient descent update rule to be simplified significantly, which was used in Ji & Telgarsky (2018) to show convergence to a max-margin solution. In this work, we generalize the definition of alignment to the multidimensional setting. We study when alignment can occur and moreover, under which conditions it is an invariant of training in linear neural networks under gradient descent. Prior works (Gidel et al., 2019; Saxe et al., 2014; 2019) have implicitly relied on invariance of alignment as an assumption on initialization to simplify training dynamics for 2 layer networks. In this work, we provide necessary and sufficient conditions for when alignment is an invariant for networks of arbitrary depth. Our main contributions are as follows: 1. We extend the definition of alignment from the 1-dimensional classification setting to the multi-dimensional setting (Definition 2) and characterize when alignment is an invariant of training in linear fully connected networks with multi-dimensional outputs (Theorem 1). 2. We demonstrate that alignment is an invariant for fully connected networks with multidimensional outputs only in special problem classes including autoencoding, matrix factorization and matrix sensing. This is in contrast to networks with 1-dimensional outputs, where there exists an initialization such that adjacent layers remain aligned throughout training under any real-valued loss function and any training dataset. 3. Alignment largely simplifies the analysis of training linear networks: We provide an explicit learning rate under which gradient descent converges linearly to a global minimum under alignment in the squared loss setting (Proposition 1). 4. We prove that alignment cannot occur, let alone be invariant, in networks with constrained layer structure (such as convolutional networks), when the amount of training data dominates the dimension of the layer structure (Theorem 3). 5. We support our theoretical findings via experiments in Section 6. As a consequence, our characterization of the invariance properties of alignment provides settings under which the gradient descent dynamics can be simplified and the implicit regularization properties can be fully understood, yet also shows that further results are required to explain implicit regularization in linear neural networks more generally.

2. RELATED WORK

Implicit regularization in overparameterized networks has become a subject of significant interest (Gunasekar et al., 2018a; b; Martin & Mahoney, 2018; Neyshabur et al., 2014) . In order to characterize the specific form of implicit regularization, several works have focused on analyzing deep linear networks (Arora et al., 2019b; Gunasekar et al., 2018b; 2017; Soudry et al., 2018) . Even though such networks can only express linear maps, parameter optimization in linear networks is non-convex and is studied in order to obtain intuition about optimization of deep networks more generally. One such form of implicit regularization is alignment, identified by Ji & Telgarsky (2018) to analyze linear fully connected networks with 1-dimensional outputs trained on linearly separable data. They proved that in the limit of training, each layer, after normalization, approaches a rank 1 matrix, i.e. lim t→∞ W (t) i W (t) i F = u i v T i and that adjacent layers, W i+1 and W i become aligned, i.e. |v T i+1 u i | → 1. In addition, Ji & Telgarsky (2018) proved that alignment in this setting occurs concurrently with convergence to the max-margin solution. Follow-up work mainly focused on this convergence phenomenon and gave explicit convergence rates for overparameterized networks trained with gradient descent (Arora et al., 2019c; Zou et al., 2018) . Our definition of invariance of alignment extends assumptions on initialization appearing in various prior works (Gidel et al., 2019; Saxe et al., 2014; 2019) . While the connection to alignment was not mentioned in their work, the authors in Gidel et al. (2019) begin to generalize alignment to multidimensional outputs by considering two-layer networks initialized so that layers are aligned with each other and to the data. We generalize this to networks of any depth, showing that our definition of alignment corresponds to the initialization considered in Gidel et al. (2019) . Moreover, we establish necessary and sufficient conditions for when alignment is an invariant of training in Theorem 1 instead of assuming these conditions. Furthermore, their result on sequential learning of components can be derived via our singular value update rule in Corollary 1. Balancedness is another closely related form of implicit regularization in linear neural networks. It was introduced in Arora et al. ( 2018) and defined as the property that if W T i W i = W i+1 W T i+1 for all i at initialization, then this property is invariant under gradient flow. Du et al. (2018) present a more general form, that W T i W i -W i+1 W T i+1 is constant under gradient flow. In practice, analyses rely on this quantity being close to or exactly zero. In this exact setting, balancedness indeed implies alignment of singular vector spaces between consecutive layers. To study gradient descent, slightly more general notions such as approximate balancedness (Arora et al., 2019a) and -balancedness have been introduced. Du et al. ( 2018) also defined balancedness with respect to convolutional networks, showing that under gradient flow, the difference in the norm of the weights of consecutive layers is an invariant. Generally, the goal of identifying invariants of training such as balancedness or alignment is to help understand both the dynamics of training and properties of solutions at the end of training.

