REVEALING THE STRUCTURE OF DEEP NEURAL NET-WORKS VIA CONVEX DUALITY

Abstract

We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weights for a norm regularized DNN training problem can be explicitly found as the extreme points of a convex set. For the special case of deep linear networks with K outputs, we prove that each optimal weight matrix is rank-K and aligns with the previous layers via duality. More importantly, we apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds. As a corollary, we prove that norm regularized deep ReLU networks yield spline interpolation for one-dimensional datasets which was previously known only for two-layer networks. Furthermore, we provide closed-form solutions for the optimal layer weights when data is rankone or whitened. We then verify our theory via numerical experiments. 1 Extensions to other loss functions, e.g., cross entropy and hinge loss, are presented Appendix A.1 4 This corresponds to weak regularization, i.e., β → 0 in equation 1 (see e.g. Wei et al. (2018) .). 5 All the equivalence lemmas and proofs are presented in Appendix A.3.

1. INTRODUCTION

Deep neural networks (DNNs) have become extremely popular due to their success in machine learning applications. Even though DNNs are highly over-parameterized and non-convex, simple first-order algorithms, e.g., Stochastic Gradient Descent (SGD), can be used to successfully train them. Moreover, recent work has shown that highly over-parameterized networks trained with SGD obtain simple solutions that generalize well (Savarese et al., 2019; Parhi & Nowak, 2019; Ergen & Pilanci, 2020a; b) , where two-layer ReLU networks with the minimum Euclidean norm solution and zero training error are proven to fit a linear spline model in 1D regression. Therefore, regularizing the solution towards smaller norm weights might be the key to understand the generalization properties of DNNs. However, analyzing DNNs is still theoretically elusive even in the absence of nonlinear activations. Therefore, we study norm regularized DNNs and develop a framework based on convex duality such that a set of optimal solutions to the training problem can be analytically characterized. Deep linear networks have been the subject of extensive theoretical analysis due to their tractability. A line of research (Saxe et al., 2013; Arora et al., 2018a; Laurent & Brecht, 2018; Du & Hu, 2019; Shamir, 2018) focused on GD training dynamics, however, they lack the analysis of generalization properties of deep networks. Another line of research (Gunasekar et al., 2017; Arora et al., 2019; Bhojanapalli et al., 2016) studied the generalization properties via matrix factorization and showed that linear networks trained with GD converge to minimum nuclear norm solutions. Later on, Arora et al. (2018b) ; Du et al. (2018) showed that gradient flow enforces the layer weights to align. Ji & Telgarsky (2019) further proved that each layer weight matrix is asymptotically rank-one. These results provide insights to characterize the structure of the optimal layer weights, however, they require multiple strong assumptions, e.g., linearly separable training data and strictly decreasing loss function, which makes the results impractical. Furthermore, Zhang et al. (2019) provided some characterizations for nonstandard networks, which are valid for hinge loss and specific regularizations where the data matrix is included. Unlike these studies, we introduce a complete characterization for the regularized deep network training problem without requiring such assumptions. Our contributions: 1) We introduce a convex analytic framework that characterizes a set of optimal solutions to regularized training problems as the extreme points of a convex set, which is valid for vector outputs and popular loss functions including squared, cross entropy and hinge loss 1 ; 2) For deep linear networks with K outputs, we prove that each optimal layer weight matrix aligns 1 : One dimensional interpolation using L-layer ReLU networks with 20 neurons in each hidden layer. As predicted by Corollary 4.2, the optimal solution is given by piecewise linear splines for any L ≥ 2. Additionally, we provide a comparison with previous studies about this characterization. with the previous layers and becomes rank-K via convex duality; 3) For deep ReLU networks, we obtain the same weight alignment result for whitened or rank-one data matrices. As a corollary, we achieve closed-form solutions for the optimal hidden layer weights when data is whitened or rank-one (see Theorem 4.1 and 4.3). As another corollary, we prove that the optimal networks are linear spline interpolators for one-dimensional, i.e., rank-one, data which generalizes the two-layer results for one-dimensional data in Savarese et al. (2019) ; Parhi & Nowak (2019) ; Ergen & Pilanci (2020a; b) to arbitrary depth. We note that the analysis of ReLU networks for the one dimensional data considered in these works is non-trivial, which is a special case of our rank-one/whitened data assumption. Notation: We denote matrices/vectors as uppercase/lowercase bold letters. We use 0 k (or 1 k ) and I k to denote a vector of zeros (or ones) and the identity matrix of size k, respectively. We denote the set of integers from 1 to n as [n] . To denote Frobenius, operator, and nuclear norms, we use • F , • 2 , and • * , respectively. Furthermore, σ max (•) and σ min (•) represent the maximum and minimum singular values, respectively and B 2 is defined as B 2 := {u ∈ R d | u 2 ≤ 1}.

1.1. OVERVIEW OF OUR RESULTS

We consider an L-layer network with layer weights W l ∈ R m l-1 ×m l , ∀l ∈ [L] , where m 0 = d and m L = 1, respectively. Then, given a data matrix X ∈ R n×d , the output is f θ,L (X) = A L-1 w L , A l = g(A l-1 W l ) ∀l ∈ [L -1], where A 0 = X and g(•) is the activation function. Given a label vector y ∈ R n , training problem can be formulated as follows min {θ l } L l=1 L(f θ,L (X), y) + βR(θ) , where L(•, •) is an arbitrary loss function, R(θ) is regularization for the layer weights, β > 0 is a regularization parameter, θ l = {W l , m l }, and θ := {θ l } L l=1 . In the paper, for the sake of presentation simplicity, we illustrate the conventional training setup with squared loss and 2 -norm regularization, i.e., L(f θ,L (X), y) = f θ,L (X) -yfoot_0 2 and R(θ) = L l=1 W l 2 F . However, our analysis is valid for arbitrary loss functions and different regularization terms as proven in Appendix. Thus, we consider the following optimization problem P * = min {θ l } L l=1 L(f θ,L (X), y) + β L l=1 W l 2 F . (2) Next, we show that the minimum 2 2 norm is equivalent to minimum 1 norm after a rescaling. Lemma 1.1. The following problems are equivalent : min {θ l } L l=1 L(f θ,L (X), y) + β L l=1 W l 2 F = min {θ l } L l=1 ,t L(f θ,L (X), y) + 2β w L 1 + β(L -2)t 2 s.t. w L-1,j ∈ B 2 , W l F ≤ t, ∀l ∈ [L -2] , where w L-1,j denotes the j th column of W L-1 . Using Lemma 1.1 2 , we first take the dual with respect to the output layer weights w L and then change the order of min-max to achieve the following dual deep network training problem, which provides a lower boundfoot_1  P * ≥D * = min t max λ min w L-1,j ∈B2,∀j W l F ≤t, ∀l∈[L-2] -L * (λ) + β(L -2)t 2 s.t. A T L-1 λ ∞ ≤ 2β . To the best of our knowledge, the above dual deep network characterization is novel. Using this result, we first characterize a set of weights that minimize the objective via the optimality conditions and active constraints in the dual objective. We then prove the optimality of these weights by proving strong duality, i.e., P * = D * , for deep networks. We then show that, for deep linear networks with K outputs, optimal weight matrices are rank-K and align with the previous layers. More importantly, the same analysis and conclusions also apply to deep ReLU networks with K outputs when the input is whitened and/or rank-one. To the best of our knowledge, this is the first work providing a complete characterization for deep ReLU networks via convex duality. Based on this analysis, we even obtain closed-form solutions for the optimal layer weights. As a corollary, we show that deep ReLU networks fit a linear spline interpolation when the input is a one-dimensional dataset. We also provide an experiment in Figure 1 to verify this claim. We emphasize that this result was previously known only for two-layer networks (Savarese et al., 2019; Parhi & Nowak, 2019; Ergen & Pilanci, 2020a; b) and here we extend it to arbitrary depth L (see Table 1 for details).

2. WARMUP: TWO-LAYER LINEAR NETWORKS

To illustrate an application of the convex dual D * , we consider the simple case of two-layer linear networks with the output f θ,2 (X) = XW 1 w 2 and define the parameter space as θ ∈ Θ = {(W 1 , w 2 , m) | W 1 ∈ R d×m , w 2 ∈ R m , m ∈ Z + }. Motivated by recent results (Neyshabur et al., 2014; Chizat & Bach, 2018; Savarese et al., 2019; Parhi & Nowak, 2019; Ergen & Pilanci, 2020a; b) , we first focus on a minimum norm 4 variant of equation 1 when L(f θ,L (X), y) = f θ,L (X) -y 2 2 and then extend it to equation 1. The minimum norm primal training problem can be written as min θ∈Θ W 1 2 F + w 2 2 2 s.t. f θ,2 (X) = y. Using Lemma A.1 5 , we equivalently have P * = min θ∈Θ w 2 1 s.t. f θ,2 (X) = y, w 1,j ∈ B 2 , ∀j, which has the following dual form. Theorem 2.1. The dual of the problem in equation 4 is given by P * ≥ D * = max λ∈R n λ T y s.t. max w1∈B2 λ T Xw 1 ≤ 1 . ( ) For finite width networks, there exists a finite m such that strong duality holds, i.e., P * = D * , and an optimal W 1 for equation 4 satisfies (XW * 1 ) T λ * ∞ = 1 , where λ * is the dual optimal parameter. Using Theorem 2.1, we now characterize the optimal neurons as the extreme points of a convex set. Corollary 2.1. Theorem 2.1 implies that the optimal neurons are extreme points which solve the following problem arg max w1∈B2 |λ * T Xw 1 |. Definition 1. We call the maximizers of the constraint in Corollary 2.1 extreme points. From Theorem 2.1, we have the following dual problem max λ λ T y s.t. max w1∈B2 |λ T Xw 1 | ≤ 1. (6) Let X = U x Σ x V T x be the singular value decomposition (SVD) of Xfoot_2 . If we assume that there exists w * such that Xw * = y due to Proposition 2.1, then equation 6 is equivalent to max λ λT Σ x w * s.t. Σ T x λ 2 ≤ 1, where λ = U T x λ and w * = V T x w * . Notice that in equation 7, we use an alternative formulation for the constraint, i.e., X T λ 2 ≤ 1 instead of |λ T Xw 1 | ≤ 1, ∀w 1 ∈ B 2 since the extreme point is achieved when w 1 = X T λ/ X T λ 2 . Given rank(X) = r ≤ min{n, d}, we have λT Σ x w * = λT Σ x I r 0 r×d-r 0 d-r×r 0 d-r×d-r w * w * r ≤ Σ T x λ 2 w * r 2 ≤ w * r 2 , which shows that the maximum objective value is achieved when Σ T x λ = c 1 w * r . Thus, we have w * 1 = V x Σ T x λ V x Σ T x λ 2 = V x w * r w * r 2 = P X T (w * ) P X T (w * ) 2 , where P X T (•) projects its input onto the range of X T . In the following results, we show that one can consider a planted model without loss of generality and prove strong duality for equation 4. Proposition 2.1. [Du & Hu (2019) ] Given w * = arg min w Xwy 2 , we have arg min W1,w2 XW 1 w 2 -Xw * 2 2 = arg min W1,w2 XW 1 w 2 -y 2 2 . Theorem 2.2. Let {X, y} be feasible for equation 4, then strong duality holds for finite width networks.

2.1. REGULARIZED TRAINING PROBLEM

In this section, we define the regularized version of equation 4 as min θ∈Θ 1 2 f θ,2 (X) -y 2 2 + β w 2 1 s.t. w 1,j ∈ B 2 , ∀j which has the following dual form max λ - 1 2 λ -y 2 2 + 1 2 y 2 2 s.t. max w1∈B2 |λ T Xw 1 | ≤ β. Then, an optimal neuron needs to satisfy the condition w * 1 = X T P X,β (y) X T P X,β (y) 2 where P X,β (•) projects its argument to {u ∈ R n | X T u 2 ≤ β}. We now prove strong duality for equation 9. Theorem 2.3. Strong duality holds for equation 9 with finite width networks.

2.2. TRAINING PROBLEM WITH VECTOR OUTPUTS

Here, the model is f θ,2 (X) = XW 1 W 2 to estimate Y ∈ R n×K , which can be optimized as follows min θ∈Θ W 1 2 F + W 2 2 F s.t. f θ,2 (X) = Y. Using Lemma A.2, we reformulate equation 10 as min θ∈Θ m j=1 w 2,j 2 s.t. f θ,2 (X) = Y, w 1,j ∈ B 2 , ∀j. which has the following dual with respect to W 2 max Λ trace(Λ T Y) s.t. Λ T Xw 1 2 ≤ 1, ∀w 1 ∈ B 2 . ( ) Since we can assume Y = XW * due to Proposition 2.1, where W * ∈ R d×K , we have trace(Λ T Y) = trace(Λ T XW * ) = trace(ΛU x Σ x W * r ) ≤ σ max (Λ T U x Σ x ) W * r * ≤ W * r * where σ max (Λ T X) ≤ 1 due to equation 12 and W * r = I r 0 r×d-r 0 d-r×r 0 d-r×d-r V T x W * . Given the SVD of W * r , i.e., U w Σ w V T w , choosing Λ T U x Σ x = V w I rw 0 rw×d-rw 0 K-rw×rw 0 K-rw×d-rw U T w achieves the upper-bound above, where r w = rank( W * r ). Thus, optimal neurons are a subset of the first r w right singular vectors of ΛX. Moreover, the next result shows that strong duality holds. Theorem 2.4. Let {X, Y} be feasible for equation 11, then strong duality holds for finite width networks.

2.2.1. REGULARIZED CASE

Here, we define the regularized version of equation 11 as follows min θ∈Θ 1 2 f θ,2 (X) -Y 2 F + β m j=1 w 2,j 2 s.t. w 1,j ∈ B 2 , ∀j. which has the following dual with respect to W 2 max Λ - 1 2 Λ -Y 2 F + 1 2 Y 2 F s.t. σ max (Λ T X) ≤ β. Then, the optimal neurons are a subset of the maximal right singular vectors of P X,β (Y) T X, where P X,β (•) projects its input to the set {U ∈ R n×K | σ max (U T X) ≤ β}. Remark 2.1. Note that the optimal neurons are the right singular vectors of P X,β (Y) T X that achieve the upper-bound of the set, i.e., P X,β (Y) T Xw * 1 2 = β, where w * 1 2 = 1. This implies that the optimal neurons satisfy Y T Xw * 1 2 ≥ β. Therefore, the number of optimal neurons and the rank of the optimal weight matrix, i.e., W * 1 , are determined by β. Remark 2.2. There might exist optimal solutions other than the right singular vectors of P X,β (Y) T X. As an example, consider u 1 and u 2 as the optimal right singular vectors. Then, any u = α 1 u 1 + α 2 u 2 with α 2 1 + α 2 2 = 1 also achieves the upper-bound, therefore, optimal.

3. DEEP LINEAR NETWORKS 7

We now consider an L-layer linear network with f θ,L (X) = XW 1 . . . w L , and the training problem P * = min {θ l } L l=1 L l=1 W l 2 F s.t. f θ,L (X) = y. Proposition 3.1. First L -2 hidden layer weight matrices in equation 14 have the same operator and Frobenius norms, i.e., t 1 = t 2 = . . . = t L-2 , where t l = W l F = W l 2 , ∀l ∈ [L -2]. Theorem 3.1. Optimal layer weights for equation 14 satisfy the following relation W * l =      t * Vx w * r w * r 2 ρ T 1 if l = 1 t * ρ l-1 ρ T l if 1 < l ≤ L -2 ρ L-2 if l = L -1 , where ρ l 2 = 1, ∀l ∈ [L -2] and w * r follows the definition in equation 8. This result clearly shows that the intra-layer weights need to satisfy an aligment condition. The next theorem shows that strong duality holds in this case. Theorem 3.2. Let {X, y} be feasible for equation 14, then strong duality holds for finite width networks. Corollary 3.1. Theorem 3.1 implies that deep linear networks can obtain a scaled version of y using only the first layer, i.e., XW 1 ρ 1 = cy, where c > 0. Therefore, the remaining layers do not contribute to the expressive power.

3.1. TRAINING PROBLEM WITH VECTOR OUTPUTS

Here, we consider vector output, i.e., m L = K, deep networks with the output f θ,L (X) = XW 1 . . . W L . In this case, we have the following training problem min {θ l } L l=1 L l=1 W l 2 F s.t. f θ,L (X) = Y. ( ) With the same approach, the optimal layer weights for equation 15 can be characterized as follows. 7 Since the derivations are similar, we present the details in Appendix A.4 and A.6. Theorem 3.3. Optimal layer weight for equation 15 can be formulated as follows W * l =      t * K j=1 ṽw,j ρ T 1,j if l = 1 t * K j=1 ρ l-1,j ρ T l,j if 1 < l ≤ L -2 K j=1 ρ L-2,j if l = L -1 , where ṽw,j is the j th maximal right singular vector of Λ T X and we may pick a set of unit norm vectors {ρ l,j } L-2 l=1 such that ρ T l,j ρ l,k = 0, ∀j = k. The next theorem formally proves that strong duality holds for the primal problem in equation 15. Theorem 3.4. Let {X, y} be feasible for equation 15, then strong duality holds for finite width networks.

4. DEEP RELU NETWORKS

Here, we consider an L-layer ReLU network with f θ,L (X) = A L-1 w L , where A l = (A l-1 W l ) + , ∀l ∈ [L -1], A 0 = X, and (x) + = max{0, x}. Below, we first state the training problem and then present our results min {θ l } L l=1 L l=1 W l 2 F s.t. f θ,L (X) = y, Theorem 4.1. Let X be a rank-one data matrix such that X = ca T 0 , where c ∈ R n + and a 0 ∈ R d , then strong duality holds and the optimal weights for each layer can be formulated as follows W l = φ l-1 φ l-1 2 φ T l , ∀l ∈ [L -2], w L-1 = φ L-2 φ L-2 2 , where φ 0 = a 0 and {φ l } L-2 l=1 is a set of vectors such that φ l ∈ R m l + and φ l 2 = t * , ∀l ∈ [L -2]. Our derivations can also be extended to cases with bias term. Below, we first examine a two-layer ReLU network training problem with bias term and then extend this result to a multi-layer network. Theorem 4.2. Let X be a data matrix such that X = ca T 0 , where c ∈ R n and a 0 ∈ R d . Then, a set of optimal solutions to equation 16 satisfies {(w i , b i )} m i=1 , where w i = s i a0 a0 2 , b i = -s i c i a 0 2 with s i = ±1, ∀i ∈ [m]. Corollary 4.1. As a result of Theorem 4.2, when we have one dimensional data, i.e., x ∈ R n , an optimal solution to equation 16 can be formulated as {(w i , b i )} m i=1 , where w i = s i , b i = -s i x i with s i = ±1, ∀i ∈ [m] . Therefore, the optimal network output has kinks only at the input data points, i.e., the output function is in the following form: f θ,2 (x) = i x -x i + . Therefore, the network output becomes linear spline interpolation for one dimensional datasets. We now extend the results in Theorem 4.2 and Corollary 4.1 for multi-layer ReLU networks. Proposition 4.1. Theorem 4.1 still holds when we add a bias term to the last hidden layer, i.e., the output becomes A L-2 W L-1 + 1 n b T + w L = y, where A l = (A l-1 W l ) + , ∀l ∈ [L -2]. Corollary 4.2. As a result of Theorem 4.2 and Proposition 4.1, when we have one dimensional data, i.e., x ∈ R n , the optimal network output has kinks only at the input data points, i.e., the output function is in the following form: f θ,L (x) = i x -x i + . Therefore, the network output becomes linear spline interpolation for one dimensional datasets. Remark 4.1. Note that in Corollary 4.1 and 4.2, we prove that the optimal output function for multi-layer networks are linear spline interpolators for rank-one data, which generalizes the twolayer results for one-dimensional data in Savarese et al. (2019) ; Parhi & Nowak (2019) ; Ergen & Pilanci (2020a; b) to arbitrary depth. We also remark that the analysis of ReLU networks for the one dimensional data considered in these works is non-trivial, which is a special case of our rank-one data assumption. The analysis in Theorem 4.1 also holds for vector output multi-layer ReLU networks as shown in the next result. Proposition 4.2. Strong duality also holds for deep ReLU networks with vector outputs and the optimal layer weights can be formulated as in Theorem 4.1. Now, we extend our characterization to arbitrary rank whitened data matrices and fully characterize the optimal layer weights of a deep ReLU network with K outputs. Theorem 4.3. Let {X, Y} be a dataset such that XX T = I nfoot_3 and Y has orthogonal columns, then the optimal weight matrices for each layer can be formulated as follows W l = 1 √ 2K 2K r=1 φ l-1,r φ l-1,r 2 φ T l,r , ∀l ∈ [L -2], W L-1 = 1 √ 2K φ L-2,1 φ L-2,1 2 . . . φ L-2,2K φ L-2,2K 2 , where (φ 0,2j-1 , φ 0,2j ) = X T y j + , X T -y j + , ∀j ∈ [K] and {φ l,r } L-2 l=1 is a set of vectors such that φ l,r ∈ R m l + , φ l,r 2 = t * , and φ T l,i φ l,j = 0, ∀i = j. Remark 4.2. In one hot encoded labeling, which is the conventional labeling for classification tasks, the label matrix Y ∈ R n×K has nonoverlapping, therefore orthogonal, columns. Hence, classification tasks with one hot encoded labels directly satisfy the assumption in Theorem 4.3. Remark 4.3. We note that the whitening assumption XX T = I n necessitates that n ≤ d, which might appear to be restrictive. However, this case is common in few-shot classification problems with limited labels (Chen et al., 2018) . Moreover, it is challenging to obtain reliable labels in problems involving high dimensional data such as in medical imaging (Hyun et al., 2020) and genetics (Singh & Yamada, 2020) , where n ≤ d is typical. More importantly, SGD employed in deep learning frameworks, e.g., PyTorch and Tensorflow, operate in minibatches rather than the full dataset. Therefore, even when n > d, each gradient descent update can only be evaluated on small batches, where the batch size n b satisfies n b d. Hence, the n ≤ d case implicitly occur during the training phase. We note that these results also hold for regularized ReLU networks as in the previous sections and we can obtain closed-form solutions for all the layers weights as proven in the next result. Theorem 4.4. Let {X, Y} be a dataset such that X T X = I n and Y has orthogonal columns, then a set of optimal layer weight matrices for the following regularized training problem min θ∈Θ 1 2 f θ,L (X) -Y 2 F + β 2 L l=1 W l 2 F (17) can be formulated as follows W l = 2K r=1 φ l-1,r φ l-1,r 2 φ T l,r , if 1 ≤ l ≤ L -1 2K r=1 ( φ 0,r 2 -β) φ l-1,r êT r if l = L , where ê2j-1 = ê2j = e j , ∀j ∈ [K], e j is the j th ordinary basis vector, and the other definitions follows from Theorem 4.3 except t * = 1. Remark 4.4. Theorem 4.4 proves that when the data matrix is whitened and the label matrix satisfies certain conditions, all the layer weights can be obtained as closed-form analytical formulas. We further note the conditions in this theorem are common in some generic regression/classification frameworks. As an example, for image classification tasks, it has been shown that whitening significantly improves the classification accuracy of the state-of-the-art architectures, e.g., ResNets, on benchmark datasets such as CIFAR-100 and ImageNet (Huang et al., 2018) . Furthermore, since the label matrix is one hot encoded in image classification tasks, it directly satisfies the condition in the theorem. Therefore, in such cases, there is no need to train a deep ReLU network in an end-to-end manner. Instead one can directly use the closed-form formulas in Theorem 4.4. 

5. NUMERICAL EXPERIMENTS

Here, we present numerical results to verify our theoretical analysis. We first use synthetic datasets generated from a random data matrix with zero mean and identity covariance and the corresponding output vector is obtained via a randomly initialized teacher networkfoot_4 . We first consider a two-layer linear network with W 1 ∈ R 20×50 and W 2 ∈ R 50×5 . To prove our claim in Remark 2.1, we train the network using GD with different β. In Figure 2a , we plot the rank of W 1 as a function of β, as well as the location of the singular values of Ŵ * Σ x V T x using vertical red lines. This shows that the rank of the layer changes when β is equal to one of the singular values, which verifies Remark 2.1. We also consider a four-layer linear network with W 1 ∈ R 5×50 , W 2 ∈ R 50×30 , W 3 ∈ R 30×40 , and W 4 ∈ R 40×5 . We then select different regularization parameters as β 1 < β 2 < β 3 < β 4 . As illustrated in Figure 2b , β determines the rank of each weight matrix and the rank is same for all the layers, which matches with our results. Moreover, to verify Proposition 3.1, we choose β such that the weights are rank-two. In Figure 3a , we numerically show that all the hidden layer weight matrices have the same operator and Frobenius norms. We also perform an experiment for a five-layer ReLU network with W 1 ∈ R 10×50 , W 2 ∈ R 50×40 , W 3 ∈ R 40×30 , W 4 ∈ R 30×20 , and w 5 ∈ R 20×1 . Here, we use data such that X = ca T 0 , where c ∈ R n + and a 0 ∈ R d . In Figure 3b , we plot the rank of each weight matrix, which converges to one as claimed Proposition 4.1. We also verify our theory on two real benchmark datasets, i.e., MNIST (LeCun) and CIFAR10 (Krizhevsky et al., 2014) . We first randomly undersample and whitened these datasets. Furthermore, we convert the labels into one hot encoded form. Then, we consider ten class classification/regression task using three multi-layer ReLU network architecture with L = 3, 4, 5. For each architecture, we use SGD with momentum for training and compare the training/test performance with the corresponding network constructed via the closed-form solutions (without any sort of training) in Theorem 4.3, i.e., denoted as "Theory". In Figure 4 , we observe that Theory achieves the optimal training objective, which also yields smaller error and higher accuracy in the test phase. Hence, these experiments numerically verify our claims in Theorem 4.3.

6. CONCLUDING REMARKS

We studied regularized DNN training problems and developed an analytic framework to characterize a set of optimal solutions. We showed that optimal layer weights can be explicitly formulated as the extreme points of a convex set via the dual problem. We then proved that strong duality (f) CIFAR10-Test accuracy Figure 4 : Training and test performance on whitened and sampled datasets, where (n, d) = (60, 90), K = 10, L = 3, 4, 5 with 50 neurons per layer and we use squared loss with one hot encoding. For Theory, we use the layer weights in Theorem 4.3, which achieves the optimal performance as guaranteed by Theorem 4.3. holds for both deep linear and ReLU networks and provided a set of optimal solutions. We also extended our derivations to the vector outputs and many other loss functions. More importantly, our analysis shows that when the input data is whitened or rank-one, instead of training an L-layer deep ReLU network in an end-to-end manner, one can directly use the closed-form solutions provided in Theorem 4.1, 4.3, and 4.4. As another corollary, we proved that the kinks of ReLU activations occur exactly at the input data points so that the optimized network outputs linear spline interpolations for one-dimensional datasets, which was previously known only for two-layer networks (Savarese et al., 2019; Parhi & Nowak, 2019; Ergen & Pilanci, 2020a; b) . We conjecture that our extreme points characterization can also be extended to reveal the structure behind cases with arbitrary data. Therefore, one can explain the extraordinary generalization properties of DNNs.

A APPENDIX

Here, we present additional materials and proofs of the main results that are not included in the main paper due to the page limit. We also restate each result before the corresponding proof for the convenience of the reader.

A.1 GENERAL LOSS FUNCTIONS

In this section, we show that our extreme point characterization holds for arbitrary convex loss functions including cross entropy and hinge loss. min θ∈Θ L(f θ,2 (X), y) + β w 2 1 s.t. w 1,j ∈ B 2 , ∀j, where L(•, y) is a convex loss function. Theorem A.1. The dual of equation 18 is given by max λ -L * (λ) s.t. X T λ 2 ≤ β, where L * is the Fenchel conjugate function defined as L * (λ) = max z z T λ -L(z, y) . Theorem A.1 proves that our extreme point characterization in Corollary 2.1 applies to arbitrary loss function. Therefore, optimal parameters for equation 3 and equation 9 are a subset of the same extreme point set, i.e., determined by the input data matrix X, independent of loss function. Remark A.1. Since our characterization is generic in the sense that it holds for vector output, deep linear and deep ReLU networks (see the main paper for details), Theorem A.1 is valid for all of our derivations.

A.2 ADDITIONAL NUMERICAL RESULTS

Here, we present numerical results that are not included in the main paper due to the page limit. In Figure 5a , we perform an experiment to check whether the hidden neurons of a two-layer linear network align with the proposed right singular vectors. For this experiment, we select a certain β such that W 1 becomes rank-two. After training, we first normalize each neuron to have unit norm, i.e., w 1,j 2 = 1, ∀j, and then compute the sum of the projections of each neuron onto each right singular vector, i.e., denoted as v i . Since we choose β such that W 1 is a rank-two matrix, most of the neurons align with the first two right singular vectors as expected. Therefore, this experiment verifies our analysis and claims in Remark 2.1. Furthermore, as an alternative to Figure 2a , we plot the singular values of W 1 with respect to the regularization parameter β in Figure 5b . 

A.3 EQUIVALENCE (RESCALING) LEMMAS FOR THE NON-CONVEX OBJECTIVES

In this section, we present all the equivalence (scaling transformation) lemmas we used in the main paper and the the proofs are presented in Appendix A.5, A.6, and A.7, two-layer, deep linear, and deep ReLU networks, respectively. Lemma 1.1. The following problems are equivalent : min {θ l } L l=1 L(f θ,L (X), y) + β L l=1 W l 2 F = min {θ l } L l=1 ,t L(f θ,L (X), y) + 2β w L 1 + β(L -2)t 2 s.t. w L-1,j ∈ B 2 , W l F ≤ t, ∀l ∈ [L -2] , where w L-1,j denotes the j th column of W L-1 . Proof of Lemma 1.1. For any θ ∈ Θ, we can rescale the parameters as wL-1,j = α j w L-1,j and wL,j = w L,j /α j , for any α j > 0. Then, the network output becomes fθ ,L (X) = XW 1 + . . . WL-1 + wL = XW 1 + . . . W L-1 + w L , which proves f θ,L (X) = fθ ,L (X). In addition to this, we have the following basic inequality L l=1 W l 2 F ≥ L-2 l=1 W l 2 F + 2 m j=1 |w L,j | w L-1,j 2 , where the equality is achieved with the scaling choice α j = |w L,j | w L-1,j 1 2 is used. Since the scaling operation does not change the right-hand side of the inequality, we can set w L-1,j 2 = 1, ∀j. Therefore, the right-hand side becomes w L 1 . Now, let us consider a modified version of the problem, where the unit norm equality constraint is relaxed as w L-1,j 2 ≤ 1. Let us also assume that for a certain index j, we obtain w L-1,j 2 < 1 with w L,j = 0 as an optimal solution. This shows that the unit norm inequality constraint is not active for w L-1,j , and hence removing the constraint for w L-1,j will not change the optimal solution. However, when we remove the constraint, w L-1,j 2 → ∞ reduces the objective value since it yields w L,j = 0. Therefore, we have a contradiction, which proves that all the constraints that correspond to a nonzero w L,j must be active for an optimal solution. This also shows that replacing w L-1,j 2 = 1 with w L-1,j 2 ≤ 1 does not change the solution to the problem. Then, we use the epigraph form for the norm of the first L -2 to achieve the equivalence.  min θ∈Θ W 1 2 F + w 2 2 2 s.t. f θ,2 (X) = y = min θ∈Θ w 2 1 s.t. f θ,2 (X) = y, w 1,j ∈ B 2 . Lemma A.2. The following problems are equivalent: min θ∈Θ W 1 2 F + W 2 2 F s.t. f θ,2 (X) = Y = min θ∈Θ m j=1 w 2,j 2 s.t. f θ,2 (X) = Y, w 1,j ∈ B 2 , ∀j . Lemma A.3. The following problems are equivalent: min {θ l } L l=1 L l=1 W l 2 F s.t. f θ,L (X) = y = min {θ l } L l=1 ,{t l } L-2 l=1 w L 1 + L-2 l=1 t 2 l s.t. f θ,L (X) = y, w L-1,j ∈ B 2 , W l F ≤ t l , ∀l ∈ [L -2] . Lemma A.4. The following problems are equivalent: min {θ l } L l=1 L l=1 W l 2 F s.t. f θ,L (X) = Y = min {θ l } L l=1 ,{t l } L-2 l=1 m L-1 j=1 w L,j 2 + L-2 l=1 t 2 l s.t. f θ,L (X) = Y, w L-1,j ∈ B 2 , W l F ≤ t l , ∀l ∈ [L -2] .

A.4 REGULARIZED EXTENSIONS

In this section, we present the regularized versions of the training problems presented in the main paper and the proofs are presented in Appendix A.  max λ - 1 2 λ -y 2 2 s.t. (XW 1 . . . W L-2 ) T λ 2 ≤ β, ∀θ l ∈ Θ L-1 , ∀l. Then, the weight matrices that maximize the value of the constraint can be described as  W * l =      t * X T P X,β (y) X T P X,β (y) 2 ρ T 1 if l = 1 t * ρ l-1 ρ T l if 1 < l ≤ L -2 ρ L-2 if l = L -1 . where P X,β (•) projects its input to u ∈ R n | X T u 2 ≤ βγ -1 . Corollary A. max Λ - 1 2 Λ -Y 2 F s.t. σ max (Λ T XW 1 . . . W L-2 ) ≤ β, ∀θ l ∈ Θ L-1 , where we define Θ L-1 = {θ 1 , . . . , θ L-1 | w L-1,j 2 ≤ 1, ∀j ∈ [m L-1 ], W l F ≤ t * , ∀l ∈ [L -2]}. Then, as in equation 32, a set of optimal layer weights is W * l =      t * K j=1 ṽx,j ρ T 1,j if l = 1 t * K j=1 ρ l-1,j ρ T l,j if 1 < l ≤ L -2 K j=1 ρ L-2,j if l = L -1 (19) where ṽx,j is a maximal right singular vector of P X,β (Y) T X and P X,β (•) projects its input to the set {U ∈ R n×k | σ max (U T X) ≤ βγ -1 }. Additionally, ρ l,j 's is an orthonormal set. Therefore, the rank of each hidden layer is determined by β as in Remark 2.1. 

A.5 PROOFS FOR THE TWO-LAYER

min θ∈Θ W 1 2 F + w 2 2 2 s.t. f θ,2 (X) = y = min θ∈Θ w 2 1 s.t. f θ,2 (X) = y, w 1,j ∈ B 2 . Proof of Lemma A.1. For any θ ∈ Θ, we can rescale the parameters as w1,j = α j w 1,j and w2,j = w 2,j /α j , for any α j > 0. Then, the network output becomes fθ ,2 (X) = m j=1 w2,j X w1,j = m j=1 w 2,j α j α j Xw 1,j = m j=1 w 2,j Xw 1,j , which proves f θ,2 (X) = fθ ,2 (X). In addition to this, we have the following basic inequality 1 2 m j=1 (w 2 2,j + w 1,j 2 2 ) ≥ m j=1 (|w 2,j | w 1,j 2 ), where the equality is achieved with the scaling choice α j = |w2,j | w1,j 2 1 2 is used. Since the scaling operation does not change the right-hand side of the inequality, we can set w 1,j 2 = 1, ∀j. Therefore, the right-hand side becomes w 2 1 . Now, let us consider a modified version of the problem, where the unit norm equality constraint is relaxed as w 1,j 2 ≤ 1. Let us also assume that for a certain index j, we obtain w 1,j 2 < 1 with w 2,j = 0 as an optimal solution. This shows that the unit norm inequality constraint is not active for w 1,j , and hence removing the constraint for w 1,j will not change the optimal solution. However, when we remove the constraint, w 1,j 2 → ∞ reduces the objective value since it yields w 2,j = 0. Therefore, we have a contradiction, which proves that all the constraints that correspond to a nonzero w 2,j must be active for an optimal solution. This also shows that replacing w 1,j 2 = 1 with w 1,j 2 ≤ 1 does not change the solution to the problem. Theorem 2.1. The dual of the problem in equation 4 is given by P * ≥ D * = max λ∈R n λ T y s.t. max w1∈B2 λ T Xw 1 ≤ 1 . ( ) For finite width networks, there exists a finite m such that strong duality holds, i.e., P * = D * , and an optimal W 1 for equation 4 satisfies (XW * 1 ) T λ * ∞ = 1 , where λ * is the dual optimal parameter. Corollary 2.1. Theorem 2.1 implies that the optimal neurons are extreme points which solve the following problem arg max w1∈B2 |λ * T Xw 1 |. Proof of Theorem 2.1 and Corollary 2.1. We first note that the dual of equation 4 with respect to w 2 is min θ∈Θ\{w2} max λ λ T y s.t. (XW 1 ) T + λ ∞ ≤ 1, w 1,j 2 ≤ 1, ∀j. Then, we can reformulate the problem as follows P * = min θ∈Θ\{w2} max λ λ T y + I( (XW 1 ) T + λ ∞ ≤ 1), s.t. w 1,j 2 ≤ 1, ∀j. where I( (XW 1 ) T λ ∞ ≤ 1) is the characteristic function of the set (XW 1 ) T λ ∞ ≤ 1, which is defined as I( (XW 1 ) T λ ∞ ≤ 1) = 0 if (XW 1 ) T λ ∞ ≤ 1 -∞ otherwise . Since the set (XW 1 ) T λ ∞ ≤ 1 is closed, the function Φ(λ, W 1 ) = λ T y + I( (XW 1 ) T λ ∞ ≤ 1) is the sum of a linear function and an upper-semicontinuous indicator function and therefore upper-semicontinuous. The constraint on W 1 is convex and compact. We use P * to denote the value of the above min-max program. Exchanging the order of min-max we obtain the dual problem given in equation 5, which establishes a lower bound D * for the above problem: P * ≥ D * = max λ min θ∈Θ\{w2} λ T y + I( (XW 1 ) T λ ∞ ≤ 1), s.t. w 1,j 2 ≤ 1, ∀j, = max λ λ T y, s.t. (XW 1 ) T λ ∞ ≤ 1 ∀w 1,j : w 1,j 2 ≤ 1, ∀j, = max λ λ T y, s.t. (Xw 1 ) T λ ∞ ≤ 1 ∀w 1 : w 1 2 ≤ 1, We now show that strong duality holds for infinite size NNs. The dual of the semi-infinite program in equation 5 is given by (see Section 2.2 of Goberna & López-Cerdá (1998) and also Bach (2017) ) min µ T V s.t. w1∈B2 Xw 1 dµ(w 1 ) = y , where TV is the total variation norm of the Radon measure µ. This expression coincides with the infinite-size NN as given in Bach (2017) , and therefore strong duality holds. We also note that although the above formulation involves an infinite dimensional integral form, by Caratheodory's theorem, the integral can be represented as a finite summation of at most n + 1 Dirac delta functions (Rosset et al., 2007) . Next we invoke the semi-infinite optimality conditions for the dual problem in equation 5, in particular we apply Theorem 7.2 of Goberna & López-Cerdá (1998). We first define the set K = cone sXw 1 1 , w 1 ∈ B 2 , s ∈ {-1, +1}; 0 n -1 . Note that K is the union of finitely many convex closed sets, since the function Xw 1 can be expressed as the union of finitely many convex closed sets. Therefore the set K is closed. By Theorem 5.3 Goberna & López-Cerdá (1998), this implies that the set of constraints in equation 5 forms a Farkas-Minkowski system. By Theorem 8.4 of Goberna & López-Cerdá (1998), primal and dual values are equal, given that the system is consistent. Moreover, the system is discretizable, i.e., there exists a sequence of problems with finitely many constraints whose optimal values approach to the optimal value of equation 5. The optimality conditions in Theorem 7.2 Goberna & López-Cerdá (1998) implies that y = XW * 1 w * 2 for some vector w * 2 . Since the primal and dual values are equal, we have λ * T y = λ * T XW * 1 w * 2 = w * 2 1 , which shows that the primal-dual pair ({w * 2 , W * 1 }, λ * ) is optimal. Thus, the optimal neuron weights W * 1 satisfy (XW * 1 ) T λ * ∞ = 1. Proposition 2.1. [Du & Hu (2019) ] Given w * = arg min w Xwy 2 , we have arg min W1,w2 XW 1 w 2 -Xw * 2 2 = arg min W1,w2 XW 1 w 2 -y 2 2 . Proof of Proposition 2.1. Let us first define a variable w * that minimizes the following problem w * = min w Xw -y 2 2 . Thus, the following relation holds X T (Xw * -y) = 0 d . Then, for any w ∈ R d , we have f (w) = Xw -Xw * + Xw * -y 2 2 = Xw -Xw * 2 2 + 2(w -w * ) T X T (Xw * -y) =0 d + Xw * -y 2 2 = Xw -Xw * 2 2 + Xw * -y 2 2 . Notice that Xw * -y 2 2 does not depend on w, thus, the relation above proves that minimizing f (w) is equivalent to minimizing Xw -Xw * 2 2 , where w * is the planted model parameter. Therefore, the planted model assumption does not change solution to the linear network training problem in equation 4. Theorem 2.2. Let {X, y} be feasible for equation 4, then strong duality holds for finite width networks. Proof of Theorem 2.2. Since there exists a single extreme point, we can construct a weight vector w e ∈ R d that is the extreme point. Then, the dual of equation 4 with W 1 = w e is D * e = max λ λ T y s.t. (Xw e ) T λ ∞ ≤ 1. ( ) Then, we have P * = min θ∈Θ\{w2} max λ λ T y ≥ max λ min θ∈Θ\{w2} λ T y s.t (XW 1 ) T λ ∞ ≤ 1, w 1,j 2 ≤ 1, ∀j s.t. (XW 1 ) T λ ∞ ≤ 1, w 1,j 2 ≤ 1, ∀j = max λ λ T y s.t. (Xw e ) T λ ∞ ≤ 1 = D * e = D * where the first inequality follows from changing order of min-max to obtain a lower bound and the equality in the second line follows from Corollary 2.1. From the fact that an infinite width NN can always find a solution with the objective value lower than or equal to the objective value of a finite width NN, we have P * e = min θ∈Θ\{W1,m} |w 2 | ≥ P * = min θ∈Θ w 2 1 (22) s.t. Xw e w 2 = y s.t. XW 1 w 2 = y, w 1,j 2 ≤ 1, ∀j, where P * is the optimal value of the original problem with infinitely many neurons. Now, notice that the optimization problem on the left hand side of equation 22 is convex since it is an 1 -norm minimization problem with linear equality constraints. Therefore, strong duality holds for this problem, i.e., P * e = D * e . Using this result along with equation 21, we prove that strong duality holds for a finite width NN, i.e., P * e = P * = D * = D * e . Theorem 2.3. Strong duality holds for equation 9 with finite width networks. Proof of Theorem 2.3. Since there exists a single extreme point, we can construct a weight vector w e ∈ R d that is the extreme point. Then, the dual of equation 9 with W 1 = w e D * e = max λ - 1 2 λ -y 2 2 + 1 2 y 2 2 s.t. |λ T Xw e | ≤ β. Then the rest of the proof directly follows Proof of Theorem 2.2. Theorem A.1. The dual of equation 18 is given by max λ -L * (λ) s.t. X T λ 2 ≤ β, where L * is the Fenchel conjugate function defined as L * (λ) = max z z T λ -L(z, y) . Proof of Theorem A.1. The proof follows from classical Fenchel duality (Boyd & Vandenberghe, 2004) . We first describe equation 18 in an equivalent form as follows min z,θ∈Θ L(z, y) + β w 2 1 s.t. z = XW 1 w 2 , w 1,j 2 ≤ 1, ∀j. Then the dual function is g(λ) = min z,θ∈Θ L(z, y) -λ T z + λ T XW 1 w 2 + β w 2 1 s.t. w 1,j 2 ≤ 1, ∀j. Therefore, using the classical Fenchel duality (Boyd & Vandenberghe, 2004 ) yields the claimed dual form. Lemma A.2. The following problems are equivalent: min θ∈Θ W 1 2 F + W 2 2 F s.t. f θ,2 (X) = Y = min θ∈Θ m j=1 w 2,j 2 s.t. f θ,2 (X) = Y, w 1,j ∈ B 2 , ∀j Proof of Lemma A.2. The proof directly follows from Proof of Lemma A.1. Theorem 2.4. Let {X, Y} be feasible for equation 11, then strong duality holds for finite width networks. Proof of Theorem 2.4. Since there exist r w possible extreme points, we can construct a weight matrix W e ∈ R d×rw that consists of all the possible extreme points. Then, the dual of equation 11 with W 1 = W e D * e = max Λ trace(Λ T Y) s.t. Λ T Xw e,j 2 ≤ 1, ∀j ∈ [r w ]. Then the rest of the proof directly follows Proof of Theorem 2.2.

A.6 PROOFS FOR THE DEEP LINEAR NETWORKS

Lemma A.3. The following problems are equivalent: min {θ l } L l=1 L l=1 W l 2 F s.t. f θ,L (X) = y = min {θ l } L l=1 ,{t l } L-2 l=1 w L 1 + L-2 l=1 t 2 l s.t. f θ,L (X) = y, w L-1,j ∈ B 2 , W l F ≤ t l , ∀l ∈ [L -2] . Proof of Lemma A.3. Applying the scaling trick in Lemma A.1 to the last two layers of the L-layer network in equation 14 gives min {θ l } L l=1 ,{t l } L-2 l=1 w L 1 + L-2 l=1 W l 2 F s.t. w L-1,j 2 ≤ 1, ∀j ∈ [m L-1 ] XW 1 . . . W L-1 w L = y . Then, we use the epigraph form for the norm of the first L -2 to achieve the equivalence. Proposition 3.1. First L -2 hidden layer weight matrices in equation 14 have the same operator and Frobenius norms, i.e., t 1 = t 2 = . . . = t L-2 , where t l = W l F = W l 2 , ∀l ∈ [L -2]. Proof of Proposition 3.1. Let us first denote the sum of the norms for the first L -2 layer as t, i.e., t = L-2 l=1 t l , where t l = W l 2 = W l F since the upper-bound is achieved when the matrices are rank-one (seeequation 28). Then, to find the extreme points, we need to solve the following problem max {θ l } L-2 l=1 W L-2 2 . . . W 1 2 V x w * r 2 . We can equivalently rewrite this problem using the variables {t l } L-2 l=1 as follows max {t l } L-2 l=1 L-2 l=1 t l s.t. t = L-2 l=1 t l , t l ≥ 0 = max {t l } L-3 l=1 t - L-3 l=1 t l L-3 j=1 t l s.t. L-3 l=1 t l ≤ t, t l ≥ 0 . If we take the derivative of the objective function of the latter problem, i.e., denoted as f (t 1 , . . . , t L-3 ), with respect to t k , we obtain the following ∂f (t 1 , . . . , t L-3 ) ∂t k = t L-3 l=1 l =k t l -2 L-3 l=1 t l - L-3 l=1 l =k t l L-3 j=1 l =k t l . Then, equating the derivative to zero yields the following relation t * k = t - L-3 l=1 t * l where t * k denotes the optimal operator norm for the k th layer's weight matrix. We also note that these solutions satisfy the constraints in the optimization problem above. Since by definition t - L-3 l=1 t * l = t * L-2 , we have t * 1 = t * 2 = . . . = t * L-2 . Theorem 3.1. Optimal layer weights for equation 14 satisfy the following relation W * l =      t * Vx w * r w * r 2 ρ T 1 if l = 1 t * ρ l-1 ρ T l if 1 < l ≤ L -2 ρ L-2 if l = L -1 , where ρ l 2 = 1, ∀l ∈ [L -2] and w * r follows the definition in equation 8. Proof of Theorem 3.1. Using Lemma A.3 and Proposition 3.1, we have the following dual problem for equation 14 P * = min {θ l } L-1 l=1 ,t max λ λ T y + (L -2)t 2 s.t. |(XW 1 . . . w L-1,j ) T λ| ≤ 1, w L-1,j ∈ B 2 W l F ≤ t, ∀l ∈ [L -2]. Now, let us assume that the optimal Frobenius norm for each layer l is t *foot_5 . Then, if we define Θ L-1 = {θ 1 , . . . , θ L-1 | w L-1,j 2 ≤ 1, ∀j ∈ [m L-1 ], W l F ≤ t * , ∀l ∈ [L -2] }, equation 23 reduces to the following problem P * ≥ D * = max λ λ T y s.t. |(XW 1 . . . w L-1 ) T λ| ≤ 1, ∀θ l ∈ Θ L-1 , ∀l , where we change the order of min-max to obtain a lower bound for equation 23. The dual of the semi-infinite problem in equation 24 is given by min µ T V s.t. {θ l } L-1 l=1 ∈Θ L-1 XW 1 . . . w L-1 dµ(θ 1 , . . . , θ L-1 ) = y , where µ is a signed Radon measure and • T V is the total variation norm. We emphasize that equation 25 has infinite width in each layer, however, an application of Caratheodory's theorem shows that the measure µ in the integral can be represented by finitely many (at most n + 1) Dirac delta functions (Rosset et al., 2007) . Such selection of µ yields the following problem P * m = min {θ l } L l=1 w L 1 s.t. m L-1 j=1 XW j 1 . . . w j L-1 w L,j = y, θ j l ∈ Θ L-1 , ∀l We first note that since the model in equation 26 has multiple weight matrices for each layer, it has more expressive power than a regular network. Thus, we have P * ≥ P * m . Since the dual of equation 14 and equation 26 are the same, we also have D * m = D * , where D * m is the optimal dual value for equation 26. We now apply the variable change in equation 7 to equation 24 as follows max λ λT Σ x w * r s.t. W T L-2 . . . W T 1 V x Σ T x λ 2 ≤ 1, ∀θ l ∈ Θ L-1 , ∀l which shows that the maximum objective value is achieved when Σ T x λ = c 1 w * r . Thus, the optimal layer weights can be found as the maximizers of the constraint when Σ T x λ = c 1 w * r . To find the formulations explicitly, we first find an upper-bound for the constraint in equation 27 as follows W T L-2 . . . W T 1 V x Σ T x λ 2 = c 1 W T L-2 . . . W T 1 V x w * r 2 ≤ c 1 W L-2 2 . . . V x w * r 2 ≤ c 1 γ V x w * r 2 , where the last inequality follows from the constraint on each layer weight's norm and γ = t * L-2 . This upper-bound can be achieved when the layer weights are W * l =      t * Vx w * r w * r 2 ρ T 1 if l = 1 t * ρ l-1 ρ T l if 1 < l ≤ L -2 ρ L-2 if l = L -1 , where ρ l 2 = 1, ∀l ∈ [L -2] . This shows that the weight matrices are rank-one and align with each other. Therefore, an arbitrary set of unit norm vectors, i.e., {ρ l } L-2 l=1 can be chosen to achieve the maximum dual objective. We note that the layer weights in equation 28 are optimal for the relaxed problem in equation 26. However, since there exists a single possible choice for the left singular vector of W 1 and we can select an arbitrary set for {ρ l } L-2 l=1 , we achieve D * m = D * using the same layer weights. Therefore, the set of weights in equation 28 are also optimal for equation 14. Theorem 3.3. Optimal layer weight for equation 15 can be formulated as follows W * l =      t * K j=1 ṽw,j ρ T 1,j if l = 1 t * K j=1 ρ l-1,j ρ T l,j if 1 < l ≤ L -2 K j=1 ρ L-2,j if l = L -1 , where ṽw,j is the j th maximal right singular vector of Λ T X and we may pick a set of unit norm vectors {ρ l,j } L-2 l=1 such that ρ T l,j ρ l,k = 0, ∀j = k. Proof of Theorem 3.3. Using Proposition 3.1 and Lemma A.4, we obtain the following dual problem max Λ trace(Λ T Y) s.t. σ max (Λ T XW 1 . . . W L-2 ) ≤ 1, ∀θ l ∈ Θ L-1 . It is straightforward to show that the optimal layer weights are the extreme points of the constraint in equation 29, which achieves the following upper-bound max {θ l } L-2 l=1 ∈Θ L-1 σ max (Λ T XW 1 . . . W L-2 ) ≤ σ max (Λ T X)γ. This upper-bound is achieved when the first L -2 layer weights are rank-one with the singular value t * by Proposition 3.1. Additionally, the left singular vector of W 1 needs to align with one of the maximum right singular vectors of Λ T X. Since the upper-bound for the objective is achievable for any Λ, we can maximize the objective value, as in equation 13, by choosing a matrix Λ such that Λ T U x Σ x = V w γ -1 I rw 0 rx×d-rw 0 k-rw×rx 0 k-rw×d-rw U T w where W * r = U w Σ w V T w . Thus, a set of optimal layer weights can be formulated as follows W j l =    t * ṽw,j ρ T 1,j if l = 1 t * ρ l-1,j ρ T l,j if 1 < l ≤ L -2 ρ L-2,j if l = L -1 , where ṽw,j is the j th maximal right singular vector of Λ T X. However, notice that the layer weights in equation 30 are the optimal weights for the relaxed problem, i.e., min {θ l } L l=1 m L-1 j=1 w L,j 2 s.t. m L-1 j=1 XW j 1 . . . w j L-1 w T L,j = Y, ∀θ j l ∈ Θ L-1. Using the optimal layer weights in equation 30, we have the following network output for the relaxed model m L-1 j=1 XW j 1 . . . w j L-1 w T L,j = γ m L-1 j=1 q w,j w T L,j . Since we know that the objective value for equation 31 is a lower bound for equation 15, the layer weights that achieve the output above for the original problem in equation 15 is optimal. Thus, a set of optimal solutions to equation 15 can be formulated as follows W * l =      t * m L-1 j=1 ṽw,j ρ T 1,j if l = 1 t * m L-1 j=1 ρ l-1,j ρ T l,j if 1 < l ≤ L -2 m L-1 j=1 ρ L-2,j if l = L -1 where we select a set of unit norm vectors {ρ l,j } L-2 l=1 such that ρ T l,j ρ l,k = 0, ∀j = k. Theorem 3.2. Let {X, y} be feasible for equation 14, then strong duality holds for finite width networks. Proof of Theorem 3.2. We first select a set of unit norm vectors, i.e., {ρ l } L-2 l=1 , to construct weight matrices {W e l } L-1 l=1 that satisfies equation 28. Then, the dual of equation 14 can be written as D * e = max λ λ T y s.t. |(XW e 1 . . . w e L-1 ) T λ| ≤ 1 . Then, we have where the first inequality follows from changing the order of min-max to obtain a lower bound and the first equality follows from the fact that {W e l } L-1 l=1 maximizes the dual problem. Furthermore, we have the following relation between the primal problems P * = min {θ l } L-1 l=1 ∈Θ L-1 max λ λ T y ≥ max λ λ T y P * e = min w L w L 1 ≥ P * = min {θ l } L l=1 ∈Θ L-1 w L 1 (34) s.t. W e 1 . . . W e L-1 w L = y s.t. W 1 . . . W L-1 w L = y, where the inequality follows from the fact that the original problem has infinite width in each layer. Now, notice that the optimization problem on the left hand side of equation 34 is convex since it is an 1 -norm minimization problem with linear equality constraints. Therefore, strong duality holds for this problem, i. Corollary 3.1. Theorem 3.1 implies that deep linear networks can obtain a scaled version of y using only the first layer, i.e., XW 1 ρ 1 = cy, where c > 0. Therefore, the remaining layers do not contribute to the expressive power. Proof of Corollary 3.1. The proof directly follows from equation 28. Corollary A.1. The analysis above and Theorem 3.2 also show that strong duality holds for the regularized deep linear network training problem. Proof of Corollary A.1. The proof directly follows from the analysis in this section and Theorem 3.2. Lemma A.4. The following problems are equivalent: min {θ l } L l=1 L l=1 W l 2 F s.t. f θ,L (X) = Y = min {θ l } L l=1 ,{t l } L-2 l=1 m L-1 j=1 w L,j 2 + L-2 l=1 t 2 l s.t. f θ,L (X) = Y, w L-1,j ∈ B 2 , W l F ≤ t l , ∀l ∈ [L -2] . Proof of Lemma A.4. Applying the scaling trick in Lemma A.1 to the last two layer of the L-layer network in equation 15 gives min {θ l } L l=1 ,{t l } L-2 l=1 m L-1 j=1 w L,j 2 + L-2 l=1 W l 2 F s.t. w L-1,j 2 ≤ 1, ∀j ∈ [m L-1 ] XW 1 . . . W L-1 W L = Y . Then, we use the epigraph form for the norm of the first L -2 to achieve the equivalence. Theorem 3.4. Let {X, y} be feasible for equation 15, then strong duality holds for finite width networks. Proof of Theorem 3.4. We first select a set of unit norm vectors, i.e., {ρ l,j } L-2 l=1 , to construct weight matrices {W e,j l } L-1 l=1 that satisfies equation 30. Then, the dual of equation 15 can be written as D * e = max Λ trace(Λ T Y) s.t. σ max (Λ T XW e,j 1 . . . W e,j L-2 ) ≤ 1, ∀j . Then, we have P * = min {θ l } L-1 l=1 ∈Θ L-1 max Λ trace(Λ T Y) ≥ max Λ trace(Λ T Y) (35) s.t. σ max (Λ T XW 1 . . . W L-2 ) ≤ 1 s.t. σ max (Λ T XW 1 . . . W L-2 ) ≤ 1, ∀θ l ∈ Θ L-1 = max Λ trace(Λ T Y) s.t. σ max (Λ T XW e,j 1 . . . W e,j L-2 ) ≤ 1, ∀j = D * e = D * = D * m where the first inequality follows from changing the order of min-max to obtain a lower bound and the first equality follows from the fact that {W e,j l } L-1 l=1 maximizes the dual problem. Furthermore, we have the following relation between the primal problems P * e = min W L m L-1 j=1 w L,j 2 ≥ P * = min {θ l } L l=1 ∈Θ L-1 m L-1 j=1 w L,j 2 (36) s.t. m L-1 j=1 W e,j 1 . . . W e,j L-1 w T L,j = Y s.t. W 1 . . . W L-1 W L = Y, where the inequality follows from the fact that the original problem has infinite width in each layer. Now, notice that the optimization problem on the left hand side of equation 36 is convex since it is an 2 -norm minimization problem with linear equality constraints. Therefore, strong duality holds for this problem, i. A.7 PROOFS FOR THE DEEP NETWORKS Theorem 4.1. Let X be a rank-one data matrix such that X = ca T 0 , where c ∈ R n + and a 0 ∈ R d , then strong duality holds and the optimal weights for each layer can be formulated as follows W l = φ l-1 φ l-1 2 φ T l , ∀l ∈ [L -2], w L-1 = φ L-2 φ L-2 2 , where φ 0 = a 0 and {φ l } L-2 l=1 is a set of vectors such that φ l ∈ R m l + and φ l 2 = t * , ∀l ∈ [L -2]. Proposition 1. First L -2 hidden layer weight matrices in equation 16 have the same operator and Frobenius norms. Proof of Proposition 1. Let us first denote the sum of the norms for the first L -2 layer as t, i.e., t = L-2 l=1 t l , where t l = W l 2 = W l F since the upper-bound is achieved when the matrices are rank-one. Then, to find the extreme points (see the details in Proof of Theorem 4.1), we need to solve the following problem arg max {θ l } L-2 l=1 |λ * T c| a L-2 2 = arg max {θ l } L-2 l=1 ∈Θ L-1 |λ * T c| (a T L-3 W L-2 ) + 2 where we use a T L-2 = (a T L-3 W L-2 ) + . Since W L-2 F = t L-2 = t - L-3 l=1 , the objective value above becomes |λ * T c| (a L-3 2 t -L-3 l=1 . Applying this step to all the remaining layer weights gives the following problem arg max {t l } L-3 l=1 ∈Θ L-1 |λ * T c| a 0 2 t - L-3 l=1 L-3 j=1 t l s.t. s.t. L-3 l=1 t l ≤ t, t l ≥ 0. Then, the proof directly follows from Proof of Proposition 3.1. Proof of Theorem 4.1. Using Lemma A.3 and Proposition 1, this problem can be equivalently stated as min {θ l } L l=1 ∈Θ L-1 w L 1 s.t. A l = (A l-1 W l ) + , ∀l ∈ [L -1] A L-1 w L = y , which also has the following dual form P * = min {θ l } L-1 l=1 ∈Θ L-1 max λ λ T y s.t. A T L-1 λ ∞ ≤ 1 . ( ) Notice that we remove the recursive constraint in equation 38 for notational simplicity, however, A L-1 is still a function of all the layer weights except w L . Changing the order of min-max in equation 38 gives P * ≥ D * = max λ λ T y s.t. A T L-1 λ ∞ ≤ 1, ∀θ l ∈ Θ L-1 , ∀l ∈ [L -1] . The dual of the semi-infinite problem in equation 39 is given by min µ T V s.t. {θ l } L-1 l=1 ∈Θ L-1 A L-2 w L-1 + dµ(θ 1 , . . . , θ L-1 ) = y , ( ) where µ is a signed Radon measure and • T V is the total variation norm. We emphasize that equation 40 has infinite width in each layer, however, an application of Caratheodory's theorem shows that the measure µ in the integral can be represented by finitely many (at most n + 1) Dirac delta functions (Rosset et al., 2007) . Thus, we choose µ = m L-1 j=1 δ(W 1 -W j 1 , . . . , w L-1 -w j L-1 )w L,j , where δ(•) is the Dirac delta function and the superscript indicates a particular choice for the corresponding layer weight. This selection of µ yields the following problem P * m = min {θ l } L l=1 w L 1 s.t. m L-1 j=1 A j L-2 w j L-1 + w L,j = y, θ j l ∈ Θ L-1 , ∀l ∈ [L -1] . Here, we first note that even though the model in equation 41 has the same layer widths with regular deep ReLU networks, it has more expressive power since it allows us to choose multiple weight matrices for each layer. Based on this observation, we have P * ≥ P * m . As a consequence of equation 39, we can characterize the optimal layer weights for equation 41 as the extreme points that solve arg max {θ l } L-1 l=1 ∈Θ L-1 |λ * T (A L-2 w L-1 ) + | (42) where λ * is the optimal dual parameter. Since we assume that X = ca T 0 with c ∈ R n + , we have A L-2 = ca T L-2 , where a T l = (a T l-1 W l ) + , a l ∈ R m l + and ∀l ∈ [L -1]. Based on this observation, we have w L-1 = a L-2 / a L-2 2 , which reduces equation 42 to the following arg max {θ l } L-2 l=1 ∈Θ L-1 |λ * T c| a L-2 2 (43) We then apply the same approach to all the remaining layer weights. However, notice that each neuron for the first L -2 layers must have bounded Frobenius norms due to the norm constraint. If we denote the optimal 2 norms vector for the neuron in the l th layer as φ l ∈ R m l + , then we have the following formulation for the layer weights that solve equation 42 W l = φ l-1 φ l-1 2 φ T l , ∀l ∈ [L -2], w L-1 = φ L-2 φ L-2 2 , where φ 0 = a 0 , {φ l } L-2 l=1 is a set of nonnegative vectors satisfying φ l 2 = t * , ∀l ∈ [L -2]. We note that the layer weights in equation 44 are optimal for the relaxed problem in equation 41. However, since there exists a single possible choice for the left singular vector of W 1 and we can select an arbitrary set for {φ l } L-2 l=1 , the dual problems coincide for equation 16 and equation 41, i.e., we achieve D * m = D * using the same layer weights, where D * m is the optimal dual objective value for equation 41. Therefore, the set of weights in equation 44 are also optimal for equation 16. Theorem 4.2. Let X be a data matrix such that X = ca T 0 , where c ∈ R n and a 0 ∈ R d . Then, a set of optimal solutions to equation 16 satisfies {(w i , b i )} m i=1 , where w i = s i a0 a0 2 , b i = -s i c i a 0 2 with s i = ±1, ∀i ∈ [m]. Proof of Theorem 4.2. Given X = ca T 0 , all possible extreme points can be characterized as follows arg max b,w: w 2=1 |λ T Xw + b1 + | = arg max b,w: w 2=1 |λ T ca T 0 w + b1 + | = arg max b,w: w 2=1 n i=1 λ i c i a T 0 w + b + which can be equivalently stated as arg max b,w: w 2=1 i∈S λ i c i a T 0 w + i∈S λ i b s.t. c i a T 0 w + b ≥ 0, ∀i ∈ S c j a T 0 w + b ≤ 0, ∀j ∈ S c , which shows that w must be either positively or negatively aligned with a 0 , i.e., w = s a0 a0 2 , where s = ±1. Thus, b must be in the range of [max i∈S (-sc i a 0 2 ), min j∈S c (-sc j a 0 2 )] Using these observations, extreme points can be formulated as follows w λ = a0 a0 2 if i∈S λ i c i ≥ 0 -a0 a0 2 otherwise and b λ = min j∈S c (-s λ c j a 0 2 ) if i∈S λ i ≥ 0 max i∈S (-s λ c i a 0 2 ) otherwise , where s λ = sign( i∈S λ i c i ). Proposition 4.1. Theorem 4.1 still when we add a bias term to the last hidden layer, i.e., the output becomes A L-2 W L-1 + 1 n b T + w L = y, where A l = (A l-1 W l ) + , ∀l ∈ [L -2]. Proof of Proposition 4.1. Here, we add biases to the neurons in the last hidden layer of equation 16. For this case, all the equations in equation 37-equation 39 hold except notational changes due to the bias term. Thus, equation 42 changes as c i a T L-2 w L-1 + b ≥ 0, ∀i ∈ S c j a T L-2 w L-1 + b ≤ 0, ∀j ∈ S c , where S and S c are the indices for which ReLU is active and inactive, respectively. This shows that w L-1 must be w L-1 = ±1 a L-2 a L-2 2 and b ∈ [max i∈S (-c i a L-2 2 ), min j∈S c (-c j a L-2 2 )]. Then, we obtain the following w * L-1 = a L-2 a L-2 2 if i∈S λ * i c i ≥ 0 -a L-2 a L-2 2 otherwise and b * = min j∈S c (-s λ * c j a L-2 2 ) if i∈S λ * i ≥ 0 max i∈S (-s λ * c i a L-2 2 ) otherwise , (46) where s λ * = sign( i∈S λ * i c i ). This result reduces equation 45 to the following problem arg max {θ l } L-2 1 ∈Θ L-1 |C(λ * , c)| a L-2 2 , where C(λ * , c) is constant scalar independent of {W l } L-2 l=1 . Hence, this problem and its solutions are the same with equation 43 and equation 44, respectively. Corollary 4.1. As a result of Theorem 4.2, when we have one dimensional data, i.e., x ∈ R n , an optimal solution to equation 16 can be formulated as {(w i , b i )} m i=1 , where w i = s i , b i = -s i x i with s i = ±1, ∀i ∈ [m]. Therefore, the optimal network output has kinks only at the input data points, i.e., the output function is in the following form: f θ,2 (x) = i x -x i + . Therefore, the network output becomes linear spline interpolation for one dimensional datasets. Corollary 4.2. As a result of Theorem 4.2 and Proposition 4.1, when we have one dimensional data, i.e., x ∈ R n , the optimal network output has kinks only at the input data points, i.e., the output function is in the following form: f θ,L (x) = i x -x i + . Therefore, the network output becomes linear spline interpolation for one dimensional datasets. Proof of Corollary 4.1 and 4.2. Let us particularly consider the input sample a 0 . Then, the activations of the network defined by equation 44 and equation 46 are a L-1 = (a T L-2 w L-1 + b) + = ( a L-2 2 -a L-2 2 ) + = 0. Thus, if we feed c i a 0 to the network, we get a L-1 = (c i a L-2 2 -c i a L-2 2 ) + = 0, where we use the fact that optimal biases are in the form of b = -c i a L-2 2 as proved in equation 46. This analysis proves that the kink of each ReLU activation occurs exactly at one of the data points. t * L-2 Y/σ max which is also a feasible solution. Therefore, equation 50 can be equivalently written as arg max a T 1 = (a T 0 W 1 ) + = a T 0 a 0 a 0 2 φ T 1 + = a 0 2 φ T 1 a T 2 = (a T 1 W 2 ) + = a T 1 a 1 a 1 2 φ T 2 + = a 0 2 φ T 1 2 φ T {θ l } L-1 l=1 ∈Θ L-1 Y T (A L-2 w L-1 ) + 2 . ( ) We now note that since Y has orthogonal columns, equation 51 can be decomposed into k maximization problems each of which can be maximized independently to find a set of extreme points. In particular, the j th problem can be formulated as follows arg max {θ l } L-1 l=1 ∈Θ L-1 |y T j (A L-2 w L-1 ) + | ≤ max y j + 2 , -y j + 2 . Then, noting the whitened data assumption, the rest of steps directly follow Theorem 4.1 yielding the following weight matrices W l = 1 √ 2 2 r=1 φ l-1,r φ l-1,r 2 φ T l,r , ∀l ∈ [L -2], W L-1 = 1 √ 2 φ L-2,1 φ L-2,1 2 φ L-2,2 φ L-2,2 2 , where φ 0,r = X T ± y j + and {φ l,r } L-2 l=1 is a set of nonnegative vectors satisfying φ l,r 2 = t * and φ T l,1 φ l,2 = 0, ∀l. Thus, combining the extreme points for each j yields W l = 1 √ 2K 2K r=1 φ l-1,r φ l-1,r 2 φ T l,r , ∀l ∈ [L -2], W L-1 = 1 √ 2K φ L-2,1 φ L-2,1 2 . . . φ L-2,2K φ L-2,2K 2 , where (φ 0,2j-1 , φ 0,2j ) = X T y j + , X Ty j + , ∀j ∈ [K] and {φ l,r } L-2 l=1 is a set of nonnegative vectors satisfying φ l,r 2 = t * and φ T l,i φ l,j = 0 ∀i = j. Moreover, as a direct consequence of Theorem 3.4, strong duality holds for deep ReLU networks.  and the corresponding extreme points of the constraint in W l = 2K r=1 φ l-1,r φ l-1,r 2 φ T l,r , if 1 ≤ l ≤ L -1 , where the definitions follows from Theorem 4.3. We now note that given the hidden layer weight in equation 53, the primal problem in equation 17 is convex and differentiable with respect to the output layer weight W L . Thus, we can find the optimal output layer weights by simply taking derivative and equating it to zero. Applying these steps yields the following output layer weights W L = 2K r=1 ( φ 0,r 2 -β) φ L-1,r êT r , where ê2j-1 = ê2j = e j , ∀j ∈ [K] and e j is the j th ordinary basis vector.



The proof is presented in Appendix A.3. For the definitions and details see Appendix A.1. In this paper, we use full SVD unless otherwise stated. This can be achieved by applying batch whitening, which often improves accuracy(Huang et al., 2018). Additional numerical results can be found in Appendix A.2. With this assumption, (L -2)t 2 becomes constant so we ignore this term for the rest of our derivations. This can be achieved by applying batch whitening, which often improves accuracy(Huang et al., 2018).



Width (m) Depth (L) Vector outputs (K)Savarese et al. (2019)

Verification of Remark 2.1. (a) Rank of the hidden layer weight matrix as a function of β and (b) rank of the hidden layer weights for different regularization parameters, i.e., β1 < β2 < β3 < β4. Verification of Proposition 3.1 and 4.1. (a) Evolution of the operator and Frobenius norms for the layer weights of a linear network and (b) Rank of the layer weights of a ReLU network with K = 1.

(a) Projection of the hidden neurons to the right singular vectors claimed in Remark 2.1 and (b) singular values of W1 with respect to β.

Lemma A.1. [Neyshabur et al. (2014);Savarese et al. (2019);Ergen & Pilanci (2020a;b)] The following two problems are equivalent:

1. The analysis above and Theorem 3.2 also show that strong duality holds for the regularized deep linear network training problem. A.4.2 REGULARIZED TRAINING PROBLEM FOR DEEP LINEAR NETWORKS WITH VECTOR OUTPUT Using Lemma A.4 and Proposition 3.1, we have the following dual for the regularized version of equation 15

NETWORKSLemmaA.1. [Neyshabur et al. (2014);Savarese et al. (2019);Ergen & Pilanci (2020a;b)] The following two problems are equivalent:

s.t. |(XW 1 . . . w L-1 ) T λ| ≤ 1 s.t. |(XW 1 . . . w L-1 ) T λ| ≤ 1, ∀θ l ∈ Θ L-1 = max λ λ T y s.t. |(XW e 1 . . . w e L-1 ) T λ| ≤ 1 = D * e = D * = D * m

e., P * e = D * e and we have P * e ≥ P * ≥ P * m ≥ D * e = D * = D * m . Using this result along with equation 33, we prove that strong duality holds, i.e., P * e = P * = P * m = D * e = D * = D * m .

e., P * e = D * e and we have P * e ≥ P * ≥ P * m ≥ D * e = D * = D * m . Using this result along with equation 35, we prove that strong duality holds, i.e., P * e = P * = P * m = D * e = D * = D * m .

= (a T L-3 W L-2 ) + = a T L-3 a L-3 a L-3 2 φ T L-2 + = a 0 2 φ T 1 2 . . . φ T L-3 2 φ T L-2

Theorem 4.4. Let {X, Y} be a dataset such that X T X = I n and Y has orthogonal columns, then a set of optimal layer weight matrices for the following regularized training problemmin if 1 ≤ l ≤ L -1 2K r=1 ( φ 0,r 2 -β) φ l-1,r êT r if l = L ,where ê2j-1 = ê2j = e j , ∀j ∈ [K], e j is the j th ordinary basis vector, and the other definitions follows from Theorem 4.3 except t * = 1.Proof of Theorem 4.4. From Section A.4.2 and the proof of Theorem 4.3, the dual problem has a closed-form solution as follows

5, A.6, and A.7, deep linear, and deep  ReLU networks, respectively.

Appendix Table of Contents

Proposition 4.2. Strong duality also holds for deep ReLU networks with vector outputs and the optimal layer weights can be formulated as in Theorem 4.1.Proof of Proposition 4.2. For vector outputs, we have the following training problemAfter a suitable rescaling as in the previous case, the above problem has the following dualUsing equation 47, we can characterize the optimal layer weights as the extreme points that solve arg maxwhere Λ * is the optimal dual parameter. Since we assume that X = ca T 0 with c ∈ R n + , we haveBased on this observation, we have w L-1 = a L-2 / a L-2 2 , which reduces equation 48 to the following arg maxThen, the rest of steps directly follow Theorem 4.1 yielding the following weight matriceswhereMoreover, as a direct consequence of Theorem 3.4, strong duality holds for deep ReLU networks.Theorem 4.3. Let {X, Y} be a dataset such that XX T = I n 11 and Y has orthogonal columns, then the optimal weight matrices for each layer can be formulated as followswhere (φ 0,2j-1 , φ 0,2j ) = X T y j + , X Ty j + , ∀j ∈ [K] and {φ l,r } L-2 l=1 is a set of vectors such that φ l,r ∈ R m l + , φ l,r 2 = t * , and φ T l,i φ l,j = 0, ∀i = j.Proof of Theorem 4.3. For vector outputs, we have the following training problemAfter a suitable rescaling as in the previous case, the above problem has the following dualUsing equation 49, we can characterize the optimal layer weights as the extreme points that solve arg maxwhere Λ * is the optimal dual parameter. We first note that since X is whitened such that XX T = I n , equation 50 implies σ max (Λ * ) ≤ t * L-2 . Then, the objective is trivially maximized by Λ * =

