ON ALIGNMENT IN DEEP LINEAR NEURAL NET-WORKS

Abstract

We study the properties of alignment, a form of implicit regularization, in linear neural networks under gradient descent. We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018. While in fully connected networks, there always exists a global minimum corresponding to an aligned solution, we analyze alignment as it relates to the training process. Namely, we characterize when alignment is an invariant of training under gradient descent by providing necessary and sufficient conditions for this invariant to hold. In such settings, the dynamics of gradient descent simplify, thereby allowing us to provide an explicit learning rate under which the network converges linearly to a global minimum. We then analyze networks with layer constraints such as convolutional networks. In this setting, we prove that gradient descent is equivalent to projected gradient descent, and that alignment is impossible with sufficiently large datasets.

1. INTRODUCTION

Although overparameterized deep networks can interpolate randomly labeled training data (Du et al., 2019; Wu et al., 2019) , training overparameterized networks with modern optimizers often leads to solutions that generalize well. This suggests that there is a form of implicit regularization occurring through training (Zhang et al., 2017) . As an example of implicit regularization, the authors in Ji & Telgarsky (2018) proved that the layers of linear neural networks used for binary classification on linearly separable datasets become aligned in the limit of training. That is, for a linear network parameterized by the matrix product W d W d-1 . . . W 1 , the top left/right singular vectors u i and v i of layer W i satisfy |v T i+1 u i | → 1 as the number of gradient descent steps goes to infinity. Alignment of singular vector spaces between adjacent layers allows for the network representation to be drastically simplified (see Equation 3); namely, the product of all layers becomes a product of diagonal matrices with the exception of the outermost unitary matrices. If alignment is an invariant of training, then optimization over the set of weight matrices reduces to optimization over the set of singular values of weight matrices. Thus, importantly, alignment of singular vector spaces allows for the gradient descent update rule to be simplified significantly, which was used in Ji & Telgarsky (2018) to show convergence to a max-margin solution. In this work, we generalize the definition of alignment to the multidimensional setting. We study when alignment can occur and moreover, under which conditions it is an invariant of training in linear neural networks under gradient descent. Prior works (Gidel et al., 2019; Saxe et al., 2014; 2019) have implicitly relied on invariance of alignment as an assumption on initialization to simplify training dynamics for 2 layer networks. In this work, we provide necessary and sufficient conditions for when alignment is an invariant for networks of arbitrary depth. Our main contributions are as follows: 1. We extend the definition of alignment from the 1-dimensional classification setting to the multi-dimensional setting (Definition 2) and characterize when alignment is an invariant of training in linear fully connected networks with multi-dimensional outputs (Theorem 1). 2. We demonstrate that alignment is an invariant for fully connected networks with multidimensional outputs only in special problem classes including autoencoding, matrix factorization and matrix sensing. This is in contrast to networks with 1-dimensional outputs, where there exists an initialization such that adjacent layers remain aligned throughout training under any real-valued loss function and any training dataset. 3. Alignment largely simplifies the analysis of training linear networks: We provide an explicit learning rate under which gradient descent converges linearly to a global minimum under alignment in the squared loss setting (Proposition 1). 4. We prove that alignment cannot occur, let alone be invariant, in networks with constrained layer structure (such as convolutional networks), when the amount of training data dominates the dimension of the layer structure (Theorem 3). 5. We support our theoretical findings via experiments in Section 6. As a consequence, our characterization of the invariance properties of alignment provides settings under which the gradient descent dynamics can be simplified and the implicit regularization properties can be fully understood, yet also shows that further results are required to explain implicit regularization in linear neural networks more generally.

2. RELATED WORK

Implicit regularization in overparameterized networks has become a subject of significant interest (Gunasekar et al., 2018a; b; Martin & Mahoney, 2018; Neyshabur et al., 2014) . In order to characterize the specific form of implicit regularization, several works have focused on analyzing deep linear networks (Arora et al., 2019b; Gunasekar et al., 2018b; 2017; Soudry et al., 2018) . Even though such networks can only express linear maps, parameter optimization in linear networks is non-convex and is studied in order to obtain intuition about optimization of deep networks more generally. One such form of implicit regularization is alignment, identified by Ji & Telgarsky (2018) to analyze linear fully connected networks with 1-dimensional outputs trained on linearly separable data. They proved that in the limit of training, each layer, after normalization, approaches a rank 1 matrix, i.e. lim t→∞ W (t) i W (t) i F = u i v T i and that adjacent layers, W i+1 and W i become aligned, i.e. |v T i+1 u i | → 1. In addition, Ji & Telgarsky (2018) proved that alignment in this setting occurs concurrently with convergence to the max-margin solution. Follow-up work mainly focused on this convergence phenomenon and gave explicit convergence rates for overparameterized networks trained with gradient descent (Arora et al., 2019c; Zou et al., 2018) . Our definition of invariance of alignment extends assumptions on initialization appearing in various prior works (Gidel et al., 2019; Saxe et al., 2014; 2019) . While the connection to alignment was not mentioned in their work, the authors in Gidel et al. (2019) begin to generalize alignment to multidimensional outputs by considering two-layer networks initialized so that layers are aligned with each other and to the data. We generalize this to networks of any depth, showing that our definition of alignment corresponds to the initialization considered in Gidel et al. (2019) . Moreover, we establish necessary and sufficient conditions for when alignment is an invariant of training in Theorem 1 instead of assuming these conditions. Furthermore, their result on sequential learning of components can be derived via our singular value update rule in Corollary 1. Balancedness is another closely related form of implicit regularization in linear neural networks. It was introduced in Arora et al. (2018) and defined as the property that if W T i W i = W i+1 W T i+1 for all i at initialization, then this property is invariant under gradient flow. Du et al. (2018) present a more general form, that W T i W i -W i+1 W T i+1 is constant under gradient flow. In practice, analyses rely on this quantity being close to or exactly zero. In this exact setting, balancedness indeed implies alignment of singular vector spaces between consecutive layers. To study gradient descent, slightly more general notions such as approximate balancedness (Arora et al., 2019a) and -balancedness have been introduced. Du et al. (2018) also defined balancedness with respect to convolutional networks, showing that under gradient flow, the difference in the norm of the weights of consecutive layers is an invariant. Generally, the goal of identifying invariants of training such as balancedness or alignment is to help understand both the dynamics of training and properties of solutions at the end of training.

3. DEFINITION OF ALIGNMENT IN THE MULTI-DIMENSIONAL SETTING

In this section, we first define alignment for linear neural networks with multi-dimensional outputs. We then define when alignment is an invariant of training. We consider linear neural networks. Let f : R k0 → R k d denote such a d-layer network, i.e. f (x) = W d W d-1 . . . W 1 x, where W i ∈ R ki×ki-1 for i ∈ [d], where we follow the convention that [d] = {1, 2, . . . d}. Let (X, Y ) ∈ R k0×n × R k d ×n denote the set of training data pairs {(x (i) , y (i) )} for i ∈ [n]. Gradient descent with learning rate γ is used to find a solution to the following optimization problem: arg min f ∈F 1 2n n i=1 (f (x (i) ), y (i) ), ( ) where F is the set of linear functions represented by f and is a real-valued loss function. When not stated otherwise, we assume (f (x (i) ), y (i) ) = y (i) -f (x (i) ) 2 2 , which is the squared loss (MSE). In addition, we denote by W (t) i for t ∈ Z ≥0 the weight matrix W i after t steps of gradient descent. When there are no additional constraints on the matrices W i , then f is a fully connected network. We next introduce a generalized form of the singular value decomposition: Definition 1. An unsorted, signed singular value decomposition (usSVD ) of a matrix A ∈ R m×n is a triple U ∈ R m×m , Σ ∈ R m×n , V ∈ R n×n such that U, V are orthonormal matrices, Σ is diagonal, and A = U ΣV T . In contrast to the usual definition of singular value decomposition (SVD) of a matrix, the diagonal entries of Σ may be in any order and take negative values. Throughout, we will refer to the entries of Σ in a usSVD as singular values and the vectors in U, V as singular vectors. Using the usSVD, we now generalize the notion of alignment from Ji & Telgarsky (2018) to the multi-dimensional setting. Definition 2. Let f = W d W d-1 . . . W 1 be a linear network. We say that f is aligned if there exists a usSVD W i = U i Σ i V T i with U i = V i+1 for all i ∈ [d -1]. (We also say that a matrix A is aligned with another matrix B if there exist usSVD's A = U A Σ A V T A , B = U B Σ B V T B such that V A = U B . ) Note that if W i and W i+1 are rank 1 matrices in an aligned network f , then the inner product of the first columns of V i+1 and U i is 1 in absolute value. Hence Definition 2 is consistent with alignment in the 1-dimensional setting from Ji & Telgarsky (2018) . We next define when alignment is an invariant of training for deep linear networks. Again, such invariants are of interest since they may provide insights into properties of trained networks and significantly simplify the dynamics of gradient descent. Definition 3. Alignment is an invariant of training for a linear neural network f if there exists an initialization {W (0) j } d j=1 such that W (∞) 1 , W (∞) 2 , . . . , W (∞) d achieves zero training errorfoot_0 and for all gradient descent steps t ∈ Z ≥0 (a) the network f is aligned; (b) W (t) i = U i Σ (t) i V T i for all i ∈ {2, . . . d -1}, that is, U i , V i are not updated; (c) W (t) 1 = U 1 Σ (t) 1 V (t) 1 T and W (t) d = U (t) d Σ (t) d V T d , that is, U 1 and V d are not updated. If additionally, V 1 and U d are not updated for any t ∈ Z ≥0 , then we say that strong alignment is an invariant of training. When alignment is an invariant of training, there are important consequences for training. In particular, note that when the network f is aligned with usSVDs W i = U i Σ i V T i for all 1 ≤ i ≤ d, then f (x) = W d • • • W 1 x = U d d-1 i=0 Σ d-i V T 1 x. Hence if alignment is an invariant of training, then the singular vectors of layers 2 through d -1 are never updated and the analysis of gradient descent can be limited to the singular values of the layers and the matrices V 1 and U d . Remarks. For the remainder of the paper, we assume that the gradient of the loss function at initialization {W (0) i } d i=1 is non-zero. Otherwise, training with gradient descent would not proceed. We also only consider datasets (X, Y ) for which there is a linear network that achieves loss zero. This is consistent with the assumptions in Ji & Telgarsky (2018) .

4. ALIGNMENT IN FULLY CONNECTED NETWORKS

In this section, we first characterize when alignment is an invariant of training for fully connected networks (Theorem 1). In particular, we show that this is not the case in general. We then present special classes of problems for which alignment is an invariant of training, namely autoencoding, matrix factorization, and matrix sensing. In contrast, for a linear network with 1-dimensional outputs, we demonstrate that there exists an initialization for which the layers remain aligned throughout training given any dataset and any real-valued loss function. Finally, we discuss various consequences of alignment, including a proof of linear convergence of gradient descent to an interpolating solution.

4.1. CHARACTERIZATION OF ALIGNMENT WITH MULTI-DIMENSIONAL OUTPUTS

Theorem 1 is one of our main results and characterizes when alignment is an invariant of training in a fully connected network with multi-dimensional outputs. To simplify notation, we consider the case when the layers are square matrices, i.e. k i = k j for all 0 ≤ i, j ≤ d. The general result for non-square matrices is provided in Appendix D. Theorem 1. Let f : R k → R k be a linear fully connected network with d ≥ 3 square layers of size k > 1. Alignment is an invariant of training under the squared loss on a dataset (X, Y ) ∈ R k×n × R k×n if and only if there exist orthonormal matrices U, V ∈ R k×k such that U T Y X T V and V T XX T V are diagonal. The full proof of this result is presented in Appendices A-E; here, we provide a proof sketch. Proof Sketch. The proof essentially follows by induction. For the base case, we initialize the layers {W i } d i=1 to satisfy the conditions for alignment given in Definition 3. Assuming that these conditions hold at gradient descent step t, we prove that they hold at step t + 1. After substituting the alignment conditions into the gradient descent update equation for the squared loss at step t+1 and cancelling terms, we obtain that alignment is an invariant of training if and only if U (t) d T n k=1 (y (k) -f (x (k) ))x (k) T V (t) 1 is a diagonal matrix. By considering the update for W (t) 1 and W (t) d , one sees that alignment implies strong alignment and so U d , V 1 are also invariant across updates. Thus, let U d = U and V 1 = V . By expanding f (x (k) ) using equation 3, and considering the update across multiple timesteps, we obtain that the matrix in equation 4 is diagonal if and only if U T Y X T V and V T XX T V are diagonal. To complete the proof, we show in Appendix D that under strong alignment, gradient descent converges to a solution with zero training error. Theorem 1 implies that invariance of alignment throughout training holds only for special classes of problems. In particular, the above implies that alignment is an invariant of training when X and Y have the same right singular vectors, a very special condition on the data. Note that this corresponds to the = 0 data condition with the initialization considered in Gidel et al. (2019) . In Section 6, we also provide empirical support showing that alignment is not an invariant of training for important tasks that violate the data condition presented here, such as multi-class classification.

4.2. CLASSES OF PROBLEMS WITH ALIGNMENT

We next discuss classes of problems for which alignment is an invariant of training. Autoencoding: In the case when X = Y , it holds that U T Y X T V = U T XX T V . Taking U = V to be the left singular vectors of X satisfies the conditions of Theorem 1. Matrix Factorization and Inversion: In the case of matrix factorization, we have that X = I. Hence taking U and V to be the left and right singular vectors of Y respectively satisfies the conditions of Theorem 1. For matrix inversion, we have that Y = I and we proceed analogously.

Matrix Sensing. Given pairs of observations

{(M i , y i )} n i=1 with M i ∈ R k×k and y i = Tr(M T i X * ) for some unobserved matrix X * ∈ R k×k , gradient descent on {W i } n i=1 is used to solve arg min {Wi} 1 2n n i=1 y i -Tr(M T i W d W d-1 . . . W 1 ) 2 2 . Implicit regularization in the matrix sensing setting has been analyzed extensively (Arora et al., 2019b; Du et al., 2018; Gunasekar et al., 2017; Li et al., 2018) . Theorem 1 shows that alignment is an invariant of training in this setting if and only if M i = U Λ i V T for all i ∈ [n], and U d = U, V 1 = V . 1-dimensional Outputs. In Appendix F, we show that alignment is an invariant of training for fully connected networks with 1-dimensional outputs for any real-valued loss function provided that gradient descent converges to zero training error.

4.3. CONSEQUENCES OF ALIGNMENT

We next discuss various consequences of the invariance of alignment for the analysis of training. Our explicit characterization of alignment as an invariant is significant as it allows us to greatly simplify the convergence analysis of gradient descent, which is a main goal of defining an invariant of training. The following corollary (proof in Appendix B) follows from the proof of Theorem 1, and shows that under alignment the gradient descent update rule is simplified significantly. Corollary 1. Let r = min(k 0 , k 1 , . . . , k d ) > 1 and let the top left r × r submatrix of U T Y X T V be Λ and that of V T XX T V be Λ. Under the invariance of strong alignment (i.e., when Λ and Λ are diagonal), we can express the partial derivative with respect to W i as follows: ∂L ∂W i = - 1 n U i   d j=i+1 Σ j T (U T Y X T V 1 -Σ d • • • Σ 1 V T XX T V ) i-1 j=1 Σ j T   V T i . As a result, gradient descent only updates the first r values of Σ i . Let Σ (t) i be the top left r × r matrix of Σ (t) i . The updates are then given by: Σ (t+1) i = Σ i (t) + γ n d j=1 Σ (t) j (Λ - d j=1 Σ (t) j Λ). The other entries of Σ (t+1) i are not updated. We can use this corollary to provide an explicit learning rate under which gradient descent converges linearly to a global minimum. The proof of the following proposition is given in Appendix C. Proposition 1. For k ∈ [r], let σ k (W i ) denote the kth entry of Σ i in the usSVD of W i , and let λ k , λ k denote the kth entries of Λ, Λ respectively. Under the conditions of Corollary 1 and assuming that σ k (W (0) i ) > 0 and d i=1 σ k (W (0) i ) < λ k λ k for all k ∈ [r], if the learning rate satisfies γ ≤ n ln 2 d • min k σ k (W (0) i ) 2 λ k λ 2 k then gradient descent only updates the top r singular values of the solution and converges linearly to the global minimum. Remarks. Shamir (2018) shows that for linear neural networks with one-dimensional outputs, the rate of convergence can be exponentially slow in the depth. Proposition 1 shows that for a fixed depth, gradient descent converges linearly. Our analysis in Appendix C shows that our upper bound on the rate of convergence can also grow exponentially in the depth. Alignment in the Limit of Training. We briefly comment on understanding whether alignment will occur in the limit of training. The following proposition, which states that for a 2-layer network, an aligned solution achieves the minimum 2 -norm. The proof is given in Appendix G. Proposition 2. Let W 1 , W 2 be matrices such that W 2 W 1 = P , for a fixed matrix P . Then, W 1 2 F + W 2 2 F achieves a minimum at the solution where W 1 and W 2 are aligned and 0-balanced, i.e. there exist usSVD's W 1 = W ΣV T , W 2 = U ΣW T . It has been shown that SGD in the overparameterized setting for a network initialized close to zero will converge to a solution close in 2 -norm to the minimum 2 -norm solution (Azizan et al., 2019) . Therefore we expect such networks to converge to a solution which is close to an aligned solution.

5. ALIGNMENT UNDER GENERAL LAYER STRUCTURE

In the previous section, we analyzed fully connected networks, where parameters of each weight matrix are optimized independently. The most commonly used deep learning models, however, rely on convolutional layers or layers with other forms of constraints. In this section, we analyze alignment in the setting of linear networks with layer constraints. In particular, we show that when the dimension of the subspace induced by the layer constraints is small compared to the number of training samples, alignment cannot happen, let alone be an invariant of training.

5.1. LINEAR NEURAL NETWORKS WITH LAYER STRUCTURE

We start by setting up mathematical terminology to describe different layer structures. Definition 4. Let S ⊂ R m×n be a linear subspace of matrices and let {A i } r i=1 be an orthogonalfoot_1 basis for S. Layer W i has layer structure S if W i ∈ S, i.e., there exist coefficients {c i j } r j=1 ⊂ R such that W i = r j=1 c i j A j , and gradient descent operates on the {c i j } r i,j=1 . Definition 4 encompasses layer structures commonly used in practice, such as: Convolutional layers: Treating a p × p image as a vector in R p 2 , a single s × s convolutional filter with stride 1 and padding (s -1)/2 maps the image to another p × p image; this linear transformation is a matrix in R p 2 ×p 2 and the set of all such transformations forms an s 2 -dimensional subspace. The parameters of the filter are coefficients of an orthogonal basis of this subspace; see Appendix I. Layers with Sparse Connections: Consider a fixed connection pattern between layers such that the j th hidden unit in layer i depends only on a subset of units in layer i -1. In this case, the subspace S consists of matrices where particular entries are forced to be zero corresponding to missing connections between features in consecutive layers. The following theorem provides, in closed-form, the gradient descent update rules for linear networks with layer structure. The proof is provided in Appendix H. Theorem 2. Performing gradient descent on the basis coefficients {c i j } r j=1 leads to the following weight matrix updates: W (t+1) i = W (t) i -η • π S ∂l ∂W (t) i , where π S denotes the projection operator onto S. Theorem 2 shows that gradient descent in networks with layer structure is equivalent to projected gradient descentfoot_2 . Hence alignment is an invariant of training if and only if it holds throughout the projected gradient descent updates and leads to an aligned solution with zero training loss.

5.2. NECESSARY CONDITION FOR ALIGNMENT

Motivated by the above characterization via projected gradient descent, we now show that for layer structures with constrained dimension, aligned networks generally cannot achieve zero training error under the squared loss, given sufficient data (Proposition 4). This is the case even when there is a solution with the desired layer structure that achieves zero training error. Hence, if loss is minimized to zero, gradient descent must lead to a non-aligned network. We first show that for an aligned network which interpolates the data, the first and last layer must align with the pseudoinverse. The proof of this result is presented in Appendix J. Proposition 3. Let (X, Y ) ∈ R k0×n × R k d ×n such that n ≥ k 0 and X is full-rank (ensuring that XX T is invertible). If an aligned network f = W d W d-1 . . . W 1 achieves zero error under squared loss (i.e. if Y = f (X)), then W T d aligns with Y X T (XX T ) -1 , which in turn aligns with W T 1 . The following result tells us that when a linear space S of matrices is sufficiently low-dimensional, the set of matrices that align with an element of S has measure zero. While we are mainly interested in the setting where n ≥ k, we state it in full generality using m 2 = 0, when m < 2. Proposition 4. Let S be an r-dimensional linear subspace of k × k matrices. If r < k -1 -k-n 2 then the set of matrices of size k × n that can align with an element of S, excluding scalar multiples of the identity, has Lebesgue measure zero. The proof of Proposition 4 is provided in Appendix K. Taken together, Propositions 3 and 4 imply Theorem 3, which states that alignment does not occur in linear networks with constrained layer structures given enough training samples. We assume k = k 0 = • • • = k d and that all layers have the same structure, S. The statement can trivially be extended to the general setting. Theorem 3. Let n ≥ k, let X, Y ∈ R k×n be generic, let S ⊂ R k×k be a linear subspace of dimension r < k -1, and let W 1 , . . . , W d ∈ S such that at least one W i is not a scalar multiple of the identity 4 . If the network f = W d • • • W 1 satisfies Y = f (X), then f is not aligned. Theorem 3 is in contrast to fully connected networks (i.e., no layer constraints), where we showed that alignment is possible for particular classes of problems including autoencoders. An explicit example of a convolutional linear autoencoder, where alignment is ruled out by Theorem 3, is discussed next. Example. If m ≥ 4, then a generic dataset consisting of n ≥ m 2 m × m images cannot be aligned by any convolutional linear autoencoder with filter size 3, aside from the trivial case where all layers are scalar multiples of the identity. This follows from letting k = m 2 , r = 9 in Proposition 4.

6. EMPIRICAL SUPPORT

In this section, we provide experimental results to validate our findingsfoot_4 . We measure two properties: (1) invariance of alignment from initialization, and (2) alignment between layers. Invariance of alignment at time t is measured by the average dot product between corresponding columns of U (t) i and U (0) i , as well as V (t) i and V (0) i . Alignment is measured by the average dot product between corresponding columns of U (t) i and V (t) i+1 . For both, a value of 1 is perfect alignment / invariance. We begin by demonstrating that alignment is not an invariant of training for fully connected networks when the data conditions of Theorem 1 are violated. Figure 1a shows an example where alignment is not an invariant for multi-dimensional regression with random data under squared loss. We used standard normal inputs X ∈ R 9×9 and targets Y ∈ R 9×9 , and a 2-hidden layer network initialized so that alignment holds at the start of training. Since X and Y do not have the same right singular vectors, the conditions of Theorem 1 are violated, and hence alignment is not an invariant of training. In Figures 1b and c , we show that alignment is also not an invariant in standard classification settings. We trained a 2-hidden layer fully connected network to classify a linearly separable subset of 256 MNIST examples under MSE loss and cross entropy loss. Figure 1b is consistent with the generalization of Theorem 1 (Appendix D). Interestingly, this result empirically transfers to the case of cross entropy loss, suggesting that our theoretical results may also be relevant for other loss functions. In networks with constrained layer structure, Theorem 3 shows that given sufficient data alignment cannot occur. We now present empirical evidence that alignment is not an invariant of training, even when the number of training samples is much smaller than the output dimension of the network or the dimensionality of the layer structure is much larger than that of the output. In the setting of matrix factorization (Y ∈ R k×k , X = I), k = n, so Theorem 3 states that alignment is impossible when the linear structure has dimension r < k -1. In Figure 2a , we observe that alignment is not invariant also even when r ≥ k -1, by training a 2-hidden layer Toeplitz network to factorize a 4 × 4 matrix. Our network has 4 hidden units per layer and thus r = 7, k = 4, n = 4. Even when n < r < k, we observe that alignment is not an invariant. In Figure 2b , we show that alignment is not an invariant of training when autoencoding a single MNIST example using a 2-hidden layer linear convolutional network (i.e. n = 1, r = 9, k = 784). In Appendix L, we provide empirical validation that alignment is indeed an invariant of training when the data conditions of Theorem 1 are satisfied.

7. DISCUSSION

We generalized the definition of alignment to linear networks with multi-dimensional outputs. We then analyzed the invariance properties of alignment, showing that under particular data conditions alignment is an invariant for fully connected networks, which allows us to significantly simplify the convergence analysis of gradient descent. We then extended our analysis of alignment to networks with constrained layer structures, such as convolutions, and proved that alignment cannot be an invariant of training in such networks when the dimension of the layer structure r is small compared to the number of training samples n. While the simplification of gradient descent convergence analysis in the fully connected setting shows that our alignment definition is useful in understanding such networks, the fact that it does not generalize as an invariant to the constrained layer structure setting suggests that other approaches may be necessary to fully understand implicit regularization, such as studying how architecture influences the function classes that can be represented by deep networks (Savarese et al., 2019; Zhang et al., 2020; Radhakrishnan et al., 2019) .

APPENDIX

A OUTLINE OF PROOF FOR THEOREM 1, COROLLARY 1, AND PROPOSITION 1 We now provide an outline of our results and proofs. 1. In Appendix A, we introduce Lemmas 1, 2, which will be used to prove Theorem 1. 2. In Appendix B, we provide the proof of Corollary 1 -the simplification of gradient descent under alignment -which relies on Lemma 2. 3. In Appendix C, we provide the proof of Proposition 1 -linear convergence under strong alignment -which relies on Corollary 1. 4. In Appendix D, we introduce Theorem 4, which is a generalization of Theorem 1 to fully connected networks with rectangular layers. We use Lemma 2 and Proposition 1 to prove Theorem 4.

5.

In Appendix E, we finally prove Theorem 1, which follows from Theorem 4. Here, we present two lemmas that will be used extensively in our proofs. Clearly strong alignment being an invariant implies that alignment is an invariant. Now we show that alignment implies strong alignment in the case of networks with square matrix layers. Lemma 1. Let {W i } d i=1 ⊂ R k×k , where d ≥ 3. If alignment is an invariant of training under the squared loss for network f = W d W d-1 . . . W 1 on data (X, Y ) ∈ R k×n × R k×n , then strong alignment is also invariant. Proof. Assume that alignment is an invariant of training. Gradient descent on the objective arg min f ∈F 1 2n n i=1 y (i) -f (x (i) ) 2 2 (7) proceeds via the following update rule: W (t+1) i = W (t) i + γ n (W (t) d . . . W (t) i+1 ) T n l=1 (y (l) -f (x (l) ))(W (t) i-1 . . . W (t) 1 x (l) ) T , ∀i ∈ [d]. ( ) Since alignment is an invariant, the initialization satisfies W (t) i = U i Σ (t) i V T i for 2 ≤ i ≤ d -1, W (t) 1 = U 1 Σ (t) 1 V (t) 1 T , W (t) d = U (t) d Σ (t) d V T d , where U i = V i+1 for i ∈ [d-1]. For 2 ≤ i ≤ d-1, substituting into Equation (8) yields W (t+1) i = U i Σ (t) i V T i + γ n (U (t) d Σ (t) d • • • Σ (t) i+1 V T i+1 ) T n l=1 (y (l) -f (x (l) ))(U i-1 Σ (t) i-1 • • • Σ (t) 1 V (t) 1 T x (l) ) T = U i   Σ (t) i + γ n d j=i+1 Σ (t) j T U (t) d T n l=1 (y (l) -f (x (l) ))x (l) T V (t) 1 i-1 j=1 Σ (t) j T   V T i = U i   Σ (t) i + γ n d j=i+1 Σ (t) j T (U (t) d T Y X T V (t) 1 -Σ (t) d • • • Σ (t) 1 V (t) 1 T XX T V (t) 1 ) i-1 j=1 Σ (t) j T   V T i . Since alignment is an invariant, the quantity d j=i+1 Σ (t) j T (U (t) d T Y X T V (t) 1 -Σ (t) d • • • Σ (t) 1 V (t) 1 T XX T V (t) 1 ) i-1 j=1 Σ (t) j T is a diagonal matrix for all t. Since each of the Σ j are square, full rank matrices, the quantity U (t) d T Y X T V (t) 1 -Σ (t) d • • • Σ (t) 1 V (t) 1 T XX T V (t) 1 must be diagonal for all t. The update rule for W 1 is given by W (t+1) 1 = W (t) 1 + γ n (W (t) d • • • W (t) 2 ) T n l=1 (y (l) -f (x (l) ))x (l) T U 1 Σ (t+1) 1 V (t+1) 1 T = U 1 Σ (t) 1 V (t) 1 T + V 2 d j=2 Σ (t) j T U (t) d T (Y X T -U d Σ (t) d • • • Σ (t) 1 V (t) 1 T XX T ) =⇒ Σ (t+1) 1 V (t+1) 1 T V (t) 1 = Σ (t) 1 + d j=2 Σ (t) j T (U (t) d T Y X T V (t) 1 -Σ (t) d • • • Σ (t) 1 V (t) 1 T XX T V (t) 1 ), which is diagonal. Therefore V (t+1) 1 T V (t) 1 is diagonal, and since this is also an orthogonal matrix we must have that V (t+1) 1 = V (t) 1 . Similarly, the update rule for W d is given by: W (t+1) d = W (t) d + γ n n l=1 (y (l) -f (x (l) ))x (l) T (W (t) d-1 • • • W (t) 1 ) T U (t+1) d Σ (t+1) d V T d = U (t) d Σ (t) 1 V T d + (Y X T -U (t) d Σ (t) d • • • Σ (t) 1 V (t) 1 T XX T )V (t) 1 d-1 j=1 Σ (t) j T U (t) d-1 T =⇒ U (t) d T U (t+1) d Σ (t+1) d = Σ (t) d + (U (t) d T Y X T V (t) 1 -Σ (t) d • • • Σ (t) 1 V (t) 1 T XX T V (t) 1 ) d-1 j=1 Σ (t) j T U (t) d-1 T , which is diagonal. Therefore U (t) d T U (t+1) d is also diagonal, implying that U (t) d = U (t+1) d . Therefore strong alignment is also an invariant. This means that alignment being an invariant and strong alignment being an invariant are equivalent in the setting where all the k i are equal. Now that we have shown the equivalence of alignment being an invariant and strong alignment being an invariant in the setting where all the layers are square, we prove the following lemma for the general case where the k i are not necessarily all equal. Lemma 2. Let f : R k0 → R k d be a linear fully connected network as in Equation equation 1, and let r = min(k 0 , . . . , k n ). For training under the squared loss on the dataset (X, Y ), there exists an aligned initialization f -r) . (x) = W (0) d • • • W (0) 1 x such that W (t) i = U i Σ (t) i V T i for all i ∈ [d] (that is, U i , V i are not updated) if and only if there exist orthonormal matrices U ∈ R k d ×k d , V ∈ R k0×k0 such that U T Y X T V = Λ 0 0 A 1 , and V T XX T V = Λ 0 0 A 2 for diagonal r × r matrices Λ, Λ and arbitrary A 1 ∈ R (k0-r)×(k d -r) , A 2 ∈ R (k0-r)×(k0 Proof. Gradient descent on the objective arg min f ∈F 1 2n n i=1 y (i) -f (x (i) ) 2 2 proceeds via the following update rule: W (t+1) i = W (t) i + γ n (W (t) d . . . W (t) i+1 ) T n l=1 (y (l) -f (x (l) ))(W (t) i-1 . . . W (t) 1 x (l) ) T , ∀i ∈ [d], ( ) where γ is the learning rate and superscript (t) denotes the gradient descent step. Assume that the network is initialized to be aligned, that is, there exist orthonormal U i , V i and diagonal matrices Σ i such that W i = U i Σ i V T i and U i = V i+1 for i ∈ [d -1]. Substituting into Equation (10) yields W (t+1) i = U i Σ (t) i V T i + γ n (U d Σ (t) d • • • Σ (t) i+1 V T i+1 ) T n l=1 (y (l) -f (x (l) ))(U i-1 Σ (t) i-1 • • • Σ (t) 1 V T 1 x (l) ) T = U i   Σ (t) i + γ n d j=i+1 Σ (t) j T U T d n l=1 (y (l) -f (x (l) ))x (l) T V 1 i-1 j=1 Σ (t) j T   V T i = U i   Σ (t) i + γ n d j=i+1 Σ (t) j T (U T d Y X T V 1 -Σ (t) d • • • Σ (t) 1 V T 1 XX T V 1 ) i-1 j=1 Σ (t) j T   V T i . Thus strong alignment is an invariant if and only if for all i, the quantity d j=i+1 Σ (t) j T (U T d Y X T V -Σ (t) d • • • Σ (t) 1 V T 1 XX T V 1 ) i-1 j=1 Σ (t) j T is an k i × k i-1 diagonal matrix for all t. At initialization each of the Σ j have rank at least r. Considering i = 1 and i = d, the above quantity is diagonal if and only if the matrix U T d Y X T V 1 -Σ (t) d • • • Σ (t) 1 V T 1 XX T V 1 11) has its top r rows and top r columns all diagonal; i.e. we can write this expression as D 0 0 A (12) for an r × r diagonal matrix D and an arbitrary (k d -r) × (k 0 -r) matrix A. For the first direction, assume that strong alignment is an invariant, i.e. that Equation ( 11) can be written in the above block diagonal form. Define Σ (t) tot = Σ (t) d • • • Σ (t) 1 -this is a diagonal matrix whose only nonzero entries are the first r on the diagonal. We know that 12) for all gradient descent steps t, and thus the quantity U T d Y X T V 1 -Σ (t) tot V T 1 XX T V 1 is of the form of Equation ( Σ (t) tot -Σ (0) tot V T 1 XX T V 1 is of this form as well. Assuming that we've not initialized any of the singular values to be their optimal value (which is satisfied with probability 1), the top r diagonal entries of Σ (t) tot -Σ (0) tot are nonzero, which means that the top left r × r submatrix of V T 1 XX T V 1 is diagonal, and that the top right submatrix consists of all zeros. But since V T 1 XX T V 1 is symmetric, the bottom left submatrix must also consist of all zeros, and thus we have V T 1 XX T V 1 = D 2 0 0 A 2 for an r × r diagonal matrix D 2 and arbitrary (k 0 -r) × (k 0 -r) matrix A 2 . Plugging this into Equation ( 11) implies that U T d Y X T V 1 must be of this form as well. We next show the other direction. Assume that for some orthonormal matrices U and V , it holds that V T XX T V is diagonal and U T Y X T V can be written in the block matrix form given by Equation (12). Initializing the layers such that 11) is also of this block diagonal form, as desired. U d = U, V 1 = V, and U i = V i+1 for i ∈ [d -1] implies that Equation (

B PROOF OF COROLLARY 1

Proof. The conditions of strong alignment imply the conditions of Lemma 2, which in turn implies that there exist orthonormal matrices U, V such that U T Y X T V = Λ 0 0 A 1 , V T XX T V = Λ 0 0 A 2 , where Λ, Λ are r × r diagonal matrices. Furthermore, from the proof of Theorem 1, if the layers are initialized to be aligned, with U d = U and V 1 = V , then the gradient descent updates are as follows: W (t+1) i = U i   Σ (t) i + γ n d j=i+1 Σ (t) j T (U T Y X T V 1 -Σ (t) d • • • Σ (t) 1 V T XX T V ) i-1 j=1 Σ (t) j T   V T i . Since the minimum of the ranks of the Σ i is r, only the top r singular values of W i are updated. Plugging in the expressions for U T Y X T V and V T XX T V and restricting to the top r singular values (which we denote by Σ i ), we obtain the statement of Corollary 1, with the singular values of each layer being updated as: Σ i (t+1) = Σ i (t) + γ n j =i Σ j (t) (Λ - d j=1 Σ j (t) Λ). This completes the proof.

C PROOF OF PROPOSITION 1

Proof. By Corollary 1, under strong alignment, each singular value is updated independently of each other. Thus we can focus on how the kth singular value for each layer is updated. Recall that σ k (W (t) i ) denotes the kth diagonal entry of Σ (t) i . Since we're focusing on a fixed k, we drop the subscript k for convenience and let σ (t) i equal σ k (W (t) i ). The σ are updated by the following update rule: σ (t+1) i = σ (t) i + γ n j =i σ (t) j (λ k -λ k d j=1 σ (t) j ), where λ k , λ k are the kth diagonal elements of Λ , Λ. We assume that Λ and Λ have the same zero pattern. Therefore λ k = 0 if and only if λ k = 0. If both of these values are zero, then σ i is not updated. Otherwise, assume λ k , λ k = 0. Note that λ k > 0, since XX T is positive semidefinite. We can also negate columns of U to ensure that λ k > 0 as well. Let η = γλ k n , and define S (t) = d j=1 σ (t) j . This yields σ (t+1) i = σ (t) i + η S (t) σ (t) i ( λ k λ k -S (t) ). Therefore (dropping the superscript to let S = S (t) ), S (t+1) = d i=1 σ (t+1) i = d i=1 σ (t) i + ηS 1 σ (t) i ( λ k λ k -S) = S + T ⊂[d]:|T |≥1 η |T | S |T | ( λ k λ k -S) |T | i∈T 1 σ (t) i i ∈T σ (t) i = S + T ⊂[d]:|T |≥1 η |T | S |T |+1 ( λ k λ k -S) |T | i∈T 1 (σ (t) i ) 2 , and hence λ k λ k -S (t+1) = λ k λ k -S - T ⊂[d]:|T |≥1 η |T | S |T |+1 ( λ k λ k -S) |T | i∈T 1 (σ (t) i ) 2 (14) = ( λ k λ k -S)   1 - T ⊂[d]:|T |≥1 η |T | S |T |+1 ( λ k λ k -S) |T |-1 i∈T 1 (σ (t) i ) 2   . Thus we obtain λ k λ k -S (t+1) = λ k λ k -S (t) • r (t) k , where r (t) k = 1 - T ⊂[d]:|T |≥1 η |T | S |T |+1 ( λ k λ k -S) |T |-1 i∈T 1 (σ (t) i ) 2 . ( ) We aim to bound r (t) k from both above and below. First, we show that r (t) k is nonnegative in order to prove the following lemma: Lemma 3. 0 < S (j) ≤ λ k λ k for all j ≥ 0. Proof. We proceed by induction. By the original assumptions in Proposition 1, 0 < S (0) ≤ λ k λ k . Now assume that 0 < S (j) ≤ λ k λ k for all j ≤ t. By the update rule in Equation ( 13), σ (j+1) i ≥ σ (j) i . Since σ (0) i > 0, σ (j) i > 0, so S (j) > 0. We also have that i∈T 1 (σ (t) i ) 2 ≤ i∈T 1 (σ (0) i ) 2 ≤ 1 (min i σ (0) i ) 2|T | . Next, note that we can bound S |T |+1 ( λ k λ k -S) |T |-1 ≤ ( λ k λ k ) 2|T | . This means that we can upper bound the sum in Equation ( 17) as T ⊂[d]:|T |≥1 η |T | S |T |+1 ( λ k λ k -S) |T |-1 i∈T 1 (σ (t) i ) 2 ≤ T ⊂[d]:|T |≥1 η |T | (min i σ (0) i ) -2|T | ( λ k λ k ) 2|T | = 1 + η • (min i σ (0) i ) -2 ( λ k λ k ) 2 d -1. Since γ ≤ n ln 2 d • mini (σ (0) i ) 2 λ k λ 2 k , we have that η ≤ ln 2 • mini (σ (0) i ) 2 d • λ 2 k λ k 2 , and thus the right-hand side of the above expression can be upper bounded by 1 + η • (min i σ (0) i ) -2 d -1 ≤ e dη(mini σ (0) i ) -2 -1 ≤ e ln 2 -1 = 1. Therefore r (t) k ≥ 0. Plugging into Equation (16), since S (t) = S ≤ λ k λ k , we get that S (t+1) ≤ λ k λ k , which completes the inductive step. Next, we would like to upper bound r (t) k by a term independent of t in order to obtain linear convergence. We can lower bound the sum in Equation ( 17) by the sets with size 1, so T ⊂[d]:|T |≥1 η |T | S |T |+1 ( λ k λ k -S) |T |-1 i∈T 1 (σ (t) i ) 2 ≥ d i=1 ηS 2 1 (σ (t) i ) 2 ≥ ηS 2 • dS -2/d , where the last inequality is due to AM-GM . Lemma 3 implies that S (j+1) ≥ S (j) , which means that the above sum is at least ηd(S (0) ) 2-2/d , which means that we can upper bound r (t) k by r (t) k ≤ 1 -ηd(S (0) ) 2-2/d . This implies that S (t+1) is closer to λ k λ k than S is, and in particular λ k λ k -S (t+1) ≤ ( λ k λ k -S)(1 -dη(S (0) ) 2-2/d ); hence λ k λ k -S (t) ≤ ( λ k λ k -S (0) )(1 -dη(S (0) ) 2-2/d ) t . Since the initialization is fixed, the quantity 1 -dη(S (0) ) 2-2/d is fixed, and thus S (t) converges linearly to λ k λ k . Therefore each of the top k singular values converge linearly to their optimal value λ k λ k , which means that the loss converges linearly as well. To complete the proof, it suffices to show that this limit solution achieves a training loss of zero. This is proven in a more general setting at the end of Appendix E.

D PROOF OF THEOREM 4

We can finally state the generalization of Theorem 1 to the non-square setting: Theorem 4. Let f : R k0 → R k d be a linear fully connected network as in Equation equation 1, and let r = min(k 0 , . . . , k n ). Strong alignment is an invariant of training under the squared loss on the dataset (X, Y ) if and only if there exist orthonormal matrices -r) . U ∈ R k d ×k d , V ∈ R k0×k0 such that U T Y X T V = Λ 0 0 A 1 , and V T XX T V = Λ 0 0 A 2 for diagonal r × r matrices Λ, Λ and arbitrary A 1 ∈ R (k0-r)×(k d -r) , A 2 ∈ R (k0-r)×(k0 Proof. By Lemma 2 we know that under strong alignment there exist U and V satisfying the above conditions. In the other direction, Lemma 2 also tells us that given U and V satisfying the data conditions, all the conditions of strong alignment hold except for convergence to a global minimum. To conclude, we must show that regardless of the zero pattern of Λ or Λ , under a strongly aligned initialization the network converges to a solution with a loss of zero. Using the convenient notation that σ (t) i = σ k (W (t) i ), we again focus on how the kth singular values of each layer are updated, for some k ∈ [r]. Recall that the σ's are updated as σ (t+1) i = σ (t) i + γ n j =i σ (t) j (λ k -λ k d j=1 σ (t) j ). The rank of X must be at least the rank of Y in order for the data to be linearly interpolated. Therefore we can choosen U, V (via permuting columns) to ensure that whenever λ k = 0, λ k = 0 as well. This ensures that σ k (W (t) i ) is never updated. If λ k , λ k = 0, then we showed in Proposition 1 that S (t) converges to λ k /λ k in the limit. Finally, we consider the case where λ k = 0, λ k = 0. Assume that σ (t) i < 1 and γ < n λ k . Then, the σ i 's update as σ (t+1) i = σ (t) i + γ n j =i σ (t) j   -λ k d j=1 σ (t) j   = σ i   1 -η j =i (σ (t) j ) 2   , where η = γλ k n . We observe that 0 ≤ σ (t+1) i ≤ σ (t) i . Therefore 0 ≤ S (t+1) = S (t) d i=1   1 -η j =i (σ (t) j ) 2   ≤ S (t) exp   -η d i=1 j =i (σ (t) j ) 2   ≤ S (t) exp -ηdS (t) 2-2/d . Since S (0) is positive, we see that 0 ≤ S (t+1) ≤ S (t) , and therefore S (t) must converge to some constant c. Assume that c = 0. For all > 0, there exists some t such that S (T ) < c + . Then, S (T +1) ≤ S (T ) exp -ηdS (T ) 2-2/d < (c + ) exp -ηc 2-2/d , where exp -ηc 2-2/d is a constant which is less than 1. Hence if we choose such that exp -ηc 2-2/d < c+ c , then S (T +1) < c, a contradiction. Therefore c = 0, and hence S (t) → 0 = λ k /λ k . In general, we have shown that if λ k = 0, then σ k (W 1 (t)) • • • σ k (W d (t)) → λ k /λ k . This solution is given by f (x) = U d Λ Λ -1 V T 1 x , which is the solution given by the pseudoinverse which obviously has a loss of zero.

E COMPLETING THE PROOF OF THEOREM 1

Proof. In Lemma 1, we showed that in the setting where all layers are square, alignment is equivalent to strong alignment. Theorem 4 states that in general, strong alignment is an invariant if and only if there exist U, V satisfying particular data conditions. Since in the square setting r = k, by Theorem 4 we have that strong alignment is an invariant if and only if there exist U, V such that U T Y X T V and V T XX T V are diagonal, as desired. F ALIGNMENT FOR 1-DIMENSIONAL OUTPUTS Proposition 5. Assuming gradient descent avoids the point where all parameters are zero, alignment is an invariant of training for any linear fully connected network f : R k0 → R, any convex, twice continuously differentiable loss function, and data (X, Y ) ∈ R k0×n × R 1×n for which the network can achieve zero training error. Proof. If we initialize the weight matrices to be rank 1 and aligned, then the matrices {Σ (t) i } d i=1 are diagonal with a single non-zero entry. Following the proof of Theorem 1, we obtain that alignment is an invariant if the matrix d j=i+1 Σ (t) j T U T d n k=1 ∂ ∂f (x (k) ,y (k) ) x (k) T V (t) 1 i-1 j=1 Σ (t) j T is diagonal. When i = 1, d, this matrix is clearly of rank 1 and diagonal (and has a single nonzero entry). This implies that U i , V i are invariant for all i = 1, d. If i = d, then since k d = 1, the above quantity is also a rank 1 diagonal matrix, implying that U d and V d are invariant. Finally, if i = 1, the above matrix is rank-1 but not necessarily diagonal. However, all but the top row are zeros, which after plugging into the gradient descent update rule implies that U 1 is invariant as well. Importantly, layers W i+1 , W i for i ∈ [d -1] remain aligned regardless of the loss function used, as the expression above is always a diagonal matrix with a single nonzero entry when the layers are initialized to be rank 1. The final step is to show that training leads to zero error according to Definition 3. To do this, we first characterize the stationary points and then under assumptions, we prove that the loss converges to zero. We now characterize the stationary points of the above update. Let v (t) 1 denote the first column of V (t) 1 , and let σ 1 (W (t) j ) denote the top singular value in the usSVD of W (t) j . Then the stationary points are given by: 1. σ 1 (W (t) j ) = 0 for j ∈ [d]. 2. v (t) 1 ⊥ n k=1 ∂ ∂f (x (k) ,y (k) ) x (k) T If we initialize σ 1 (W 1 ) = 0, then we have that: σ 1 (W (t) 1 )v (t) 1 T = n k=1 c (t) k x (k) T c (t+1) k = n k=1   c (t) k + γ j =k σ 1 (W (t) j ) ∂ ∂f (x (k) ,y (k)   x (k) T for c (t) k ∈ R and ∀t ∈ Z ≥0 . Hence, updates to v (t) 1 are in the span of the data, and so assuming that {x (k) } n k=1 are linearly independent, v 1 cannot be orthogonal to n k=1 ∂ ∂f (x (k) ,y (k) ) x (k) T unless the c (t) k are all 0, i.e. σ 1 (W (t) 1 ) = 0 for t > 0. Next, if we initialize σ 1 (W (0) i ) = σ 1 (W (0) j ), then σ 1 (W (t) i ) = σ 1 (W (t) j ) for all i, j ∈ {2, . . . d}, t ≥ 0 since for all i ∈ {2, . . . d}: σ 1 (W (t+1) i ) = σ 1 (W (t) i ) + j =i σ 1 (W (t) j ) n k=1 ∂ ∂f (x (k) ,y (k) ) x (k) T v (t) 1 This initialization corresponds to layers W i+1 , W i being balanced for i ∈ {2, . . . d}. Thus, under this initialization, the only other stationary point is given by σ 1 (W i ) = 0 for all i ∈ {2, . . . d}. Hence, if gradient descent avoids the non-strict saddle points given by σ 1 (W i ) = 0 for all i ∈ {2, . . . , d} and σ 1 (W (t) i ) = 0 for all i ∈ [d], then gradient descent converges to a local (and thus global) minimum of the convex loss. The former stationary point can be avoided by re-parameterizing the network such that σ 1 (W (t) i ) = σ 1 for all i ∈ {2, . . . d} (i.e. σ 1 = 0 now corresponds to a strict saddle as defined in Lee et al. (2016) ), and then taking a random initialization for σ 1 . This would correspond to gradient descent on the original parameterization with a scaling factor on the learning rate for parameters σ 1 (W (t) i ) for i ∈ {2, . . . d}. The latter stationary point is avoided by the assumption in the proposition.

G PROOF OF PROPOSITION 2

Proof. For any matrices A, B ∈ C m×n , we have that 2σ i (AB * ) ≤ σ i (A * A + B * B) (Bhatia, 1997) . Thus letting A = W 2 , B = W T 1 , we see that 2σ i (W 2 W 1 ) ≤ σ i (W T 2 W 2 + W 1 W T 1 ) =⇒ 2 i σ i (P ) ≤ i σ i (W T 2 W 2 + W 1 W T 1 ) = W T 2 W 2 + W 1 W T 1 1 ≤ W T 2 W 2 1 + W 1 W T 1 1 = W 2 2 F + W 1 2 F This lower bound is in fact achieved for an aligned solution. If the SVD of P is P = U ΣV T , setting W 1 = W Σ 1 2 U T and W 2 = U Σ 1 2 V T yields W 1 2 F = W 2 2 F = Tr(Σ), so W 1 2 F + W 2 2 F = 2Tr(Σ).

H PROOF OF THEOREM 2

Proof. Given an arbitrary loss function, assume that the ith layer is restricted to some structure given by a subspace S and basis matrices A 1 , . . . A m , so that at timestep t we have that W (t) i = m j=1 (c i j ) (t) A j We take the gradient of the loss with respect to the c i j . The chain rule yields: ∂l ∂c i j = n p,q=1 ∂l ∂(W i ) pq • ∂(W i ) pq ∂c i j = n p,q=1 ∂l ∂(W i ) pq • A j pq The gradient descent update on c i j is thus: M pq A j pq A j , then gradient descent on the c gives the following update rule on the W i : (c i j ) (t+1) = (c i j ) (t) -η • W (t+1) i = W (t) i -η • π S ∂l ∂W i . If the A j all have norm 1, then, π = π S , and this is the same update rule given by projected gradient descent with respect to the subspace S. Otherwise, π S is simply the projection π followed by appropriate scaling in each of the basis directions. I TREATING A CONVOLUTIONAL LAYER AS A LINEAR SUBSPACE Consider a 3 × 3 image. We map it to a 9-dimensional vector as follows x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 =⇒ [x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 ] T . Then, the linear transformation given by applying the 3 × 3 convolutional filter Then S consists of all matrices of the form W . S is a 9-dimensional subspace of R 9×9 , with an orthonormal basis with coefficients being the c i .

J PROOF OF PROPOSITION 3

Proof. For i ∈ [d], let U i Σ i V T i be a usSVD of W i witnessing alignment of f . We can then rewrite Y = f (X) as Y = U d d i=1 Σ i V T 1 X, thus proving the desired statement.

K PROOF OF PROPOSITION 4

Before we can prove Proposition 4, we require the following definition from combinatorics. Definition 5. A partition of an integer k is a tuple λ = (λ 1 , . . . , λ s ) such that λ i ≥ λ i+1 for all i and k = λ 1 + • • • + λ s . Each λ i is called a part of λ. We let s(λ) denote the number of parts of λ and we write λ k to indicate that λ is a partition of k. Proof of Proposition 4. Given a k × k matrix A, let λ(A) denote the partition λ of k such that λ i is the multiplicity of the i th greatest singular value of A. Let U (A) denote the set of matrices U such that U ΣV T is a usSVD of A. The dimension of U (A) is s(λ(A)) i=1 λ i 2 . To see this, note that any orthonormal basis of the eigenspace of AA T corresponding to the multiplicity-λ i eigenvalue of AA T can be the corresponding columns in an element of U (A) and that the set of orthonormal bases of an m-dimensional linear space is m 2 . For any set Q of matrices, Define U (Q) to be the set of all possible sets of left-singular vectors of elements of S. That is, U  λ i 2 ≥ k 2 - k -n 2 . ( ) This is attained when λ = (k), but in this case T λ is simply the set of scalar multiples of the identity. If we forbid λ = (k), then we claim that the maximum value of r + s(λ) i=1 λi 2 is attained by λ = (k -1, 1). To see this, note that for all p < q, q -p 2 + p 2 = q 2 -p(q -p) < q 2 . For p > 0, this is maximized when p = 1. This implies that the maximum value of s(λ) i=1 λi 2 will be obtained in as few summands as possible (which in our case is two), and in particular when λ 1 = k -1 and λ 2 = 1. In this case, equation 18 becomes r + k -1 2 ≥ k 2 - k -n 2 . Taking the logical negation of the above inequality and simplifying gives r < k -1 -k-n 2 .



The interpolation condition in this definition (i.e., achieving zero training error) is important in ruling out several architectures where the layers are trivially aligned. For example, if all layers are constrained to be diagonal matrices throughout training, then the layers are all trivially aligned, but cannot interpolate datasets where the target is not the product of a diagonal matrix with the input. Orthogonality is w.r.t the inner product A, B = Tr(A T B), or equivalently the dot product in R mn πS is a projection in the traditional sense if and only if the Aj form an orthonormal basis; otherwise, πS is a projection onto S followed by an appropriate scaling in each basis direction. This is not a serious restriction; modulo scalar multiplication, the only case in which such a network could achieve zero loss is autoencoding, in which case the latent space would be a scalar multiple of the data itself. Hyperparameter settings are detailed in Appendix M



(a) Multi-dimensional regression on random data with squared loss. (b) Multi-class classification on MNIST with squared loss. (c) Multi-class classification on MNIST with cross entropy loss.

Figure 1: Examples of fully connected networks with multi-dimensional outputs where alignment is not an invariant of training.

Figure 2: Examples of layer constrained networks, where alignment is not an invariant of training.

We calculate the projection operator π of some arbitrary matrix M onto S. We can write π

5 c 4 c 3 c 2 c 1 0 0 0 0 c 6 c 5 0 c 3 c 2 0 0 0 c 8 c 7 0 c 5 c 4 0 c 2 c 1 0 c 9 c 8 c 7 c 6 c 5 c 4 c 3 c 2 c 1 0 c 9 c 8 0 c 6 c 5 0 c 3 c 2 0 0 0 c 8 c 7 0 c 5 c 4 0 0 0 0 c 9 c 8 c 7 c 6 c 5 c 4 0 0 0 0 c 9 c 8 0 c 6 c 5

For each partition λ of k, let T λ denote the set of matrices A such that λ(A) = λ. The dimension of T λ ∩ S is at most r and therefore the dimension of U (S ∩ T λ ) is at most k, n) denote the set of k ×n matrices with orthonormal columns. Assume alignment is possible over S for a non-measure-zero set of matrices with n columns. Then there existsB ⊆ O(k, n) with dim(B) = dim(O(k, n)) such that for every U ∈ B, U (S) contains a matrix whose first n columns are U . Therefore dim(U (S)) ≥ dim(O(k, n)). Since dim(O(k, n)) = k 2 -k-n2 , the following must be satisfied for some λ

L ADDITIONAL EXPERIMENTS

We provide the following empirical evidence demonstrating that when the conditions of Theorem 1 are satisfied, invariance of alignment can indeed be observed empirically. We use a 2-hidden layer fully connected network with 9 hidden units per layer. 

M EXPERIMENTAL SETUP

We provide network architectures and hyperparameters used for our experiments below. We trained our networks on an NVIDIA TITAN RTX GPU using the PyTorch library. In all settings, we train using gradient descent with a learning rate of 10 -2 until the loss was below 10 -4 .1. Figure 1a : We use a 2-hidden layer fully connected network with 9 hidden units per layer.Our data is given by matrices (X, Y ) ∈ R 9×9 where each matrix entry is drawn from a standard normal distribution.2. Figure 1b : We use a 2-hidden layer fully connected network with 1024 hidden units in the first hidden layer and 64 hidden units in the second hidden layer. Our data consists of 256 linearly separable examples from MNIST and is trained using Squared Loss.3. Figure 1c : We use a 2-hidden layer fully connected network with 1024 hidden units in the first hidden layer and 64 hidden units in the second hidden layer. Our data consists of 256 linearly separable examples from MNIST and is trained using Cross Entropy Loss.4. Figure 2a : We use a 2-hidden layer network with 4 hidden units per layer, where each layer is constrained to be a Toeplitz matrix. Our input X is equal to the identity, and our output Y is a 4 × 4 matrix with each entry sampled from a standard normal distribution.5. Figure 2b : We use a 2-hidden layer convolutional network with a single 3 × 3 filter in each layer, stride of 1, and padding of 1. Our data consists of a single example from MNIST.Code for the experiments can be found at the following anonymized github link: https:// anonymous.4open.science/r/33277cc0-6074-46c4-8642-7feadd678278/.

