PATH REGULARIZATION: A CONVEXITY AND SPAR-SITY INDUCING REGULARIZATION FOR PARALLEL RELU NETWORKS

Abstract

Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases. We then show that pathwise regularized training problems can be represented as an exact convex optimization problem. We further prove that the equivalent convex problem is regularized via a group sparsity inducing norm. Thus, a path regularized parallel ReLU network can be viewed as a parsimonious convex model in high dimensions. More importantly, since the original training problem may not be trainable in polynomial-time, we propose an approximate algorithm with a fully polynomial-time complexity in all data dimensions. Then, we prove strong global optimality guarantees for this algorithm. We also provide experiments corroborating our theory. (a) 2-layer NN with WD (b) 3-layer NN with WD (c) 3-layer NN with PR (Ours)

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved substantial improvements in several fields of machine learning. However, since DNNs have a highly nonlinear and non-convex structure, the fundamental principles behind their remarkable performance is still an open problem. Therefore, advances in this field largely depend on heuristic approaches. One of the most prominent techniques to boost the generalization performance of DNNs is regularizing layer weights so that the network can fit a function that performs well on unseen test data. Even though weight decay, i.e., penalizing the 2 2norm of the layer weights, is commonly employed as a regularization technique in practice, recently, it has been shown that 2 -path regularizer (Neyshabur et al., 2015b) , i.e., the sum over all paths in the network of the squared product over all weights in the path, achieves further empirical gains (Neyshabur et al., 2015a) . Therefore, in this paper, we investigate the underlying mechanisms behind path regularized DNNs through the lens of convex optimization. Table 1 : Complexity comparison with prior works (n: # of data samples, d: feature dimension, m l : # of hidden neurons in layer l, : approximation accuracy, r: rank of the data, κ r: chosen according to (10) such that accuracy is achieved) 2 PARALLEL NEURAL NETWORKS Although DNNs are highly complex architectures due to the composition of multiple nonlinear functions, their parameters are often trained via simple first order gradient based algorithms, e.g., Gradient Descent (GD) and variants. However, since such algorithms only rely on local gradient of the objective function, they may fail to globally optimize the objective in certain cases (Shalev-Shwartz et al., 2017; Goodfellow et al., 2016) . Similarly, Ge et al. (2017) ; Safran & Shamir (2018) showed that these pathological cases also apply to stochastic algorithms such as Stochastic GD (SGD). They further show that some of these issues can be avoided by increasing the number of trainable parameters, i.e., operating in an overparameterized regime. However, Anandkumar & Ge (2016) reported the existence of more complicated cases, where SGD/GD usually fails. Therefore, training DNNs to global optimality remains a challenging optimization problem (DasGupta et al., 1995; Blum & Rivest, 1989; Bartlett & Ben-David, 1999) . To circumvent difficulties in training, recent studies focused on models that benefit from overparameterization (Brutzkus et al., 2017; Du & Lee, 2018; Arora et al., 2018b; Neyshabur et al., 2018) . As an example, Wang et al. (2021) ; Ergen & Pilanci (2021d; c) ; Zhang et al. (2019) ; Haeffele & Vidal (2017) considered a new architecture by combining multiple standard NNs, termed as sub-networks, in parallel. Evidences in Ergen & Pilanci (2021d; c) ; Zhang et al. (2019) ; Haeffele & Vidal (2017); Zagoruyko & Komodakis (2016) ; Veit et al. (2016) showed that this way of combining NNs yields an optimization landscape that has fewer local minima and/or saddle points so that SGD/GD generally converges to a global minimum. Therefore, many recently proposed NN-based architectures that achieve state-of-the-art performance in practice, e.g., SqueezeNet (Iandola et al., 2016) , Inception (Szegedy et al., 2017) , Xception (Chollet, 2017) , and ResNext (Xie et al., 2017) , are in this form. Notation and preliminaries: Throughout the paper, we denote matrices and vectors as uppercase and lowercase bold letters, respectively. For vectors and matrices, we use subscripts to denote a certain column/element. As an example, w lkj l-1 j l denotes the j l-1 j th l entry of the matrix W lk . We use I k and 0 (or 1) to denote the identity matrix of size k × k and a vector/matrix of zeros (or ones) with appropriate sizes. We use [n] for the set of integers from 1 to n. We use • 2 and • F to represent the Euclidean and Frobenius norms, respectively. Additionally, we denote the unit p ball as B p := {u ∈ R d : u p ≤ 1}. We also use 1[x ≥ 0] and (x) + = max{x, 0} to denote the 0-1 valued indicator and ReLU, respectively. In this paper, we particularly consider a parallel ReLU network with K sub-networks and each sub-network is an L-layer ReLU network (see Figure 2 ) with layer weights W lk ∈ R m l-1 ×m l , ∀l ∈ [L -1] and w Lk ∈ R m L-1 , where m 0 = d, m L = 1foot_0 , and m l denotes the number of neurons in the l th hidden layer. Then, given a data matrix X ∈ R n×d , the output of the network is as follows f θ (X) := K k=1 (XW 1k ) + . . . W (L-1)k + w Lk , where we compactly denote the parameters as θ := k {W lk } L l=1 with the parameter space as Θ and each sub-network represents a standard deep ReLU network. Remark 1. Most commonly used neural networks in practice can be classified as special cases of parallel networks, e.g., standard NNs and ResNets (He et al., 2016) see Appendix A.2 for details.

2.1. OUR CONTRIBUTIONS

• We prove that training the path regularized parallel ReLU networks (1) is equivalent to a convex optimization problem that can be approximately solved in polynomial-time by standard convex solvers (see Table 1 ). Therefore, we generalize the two-layer results in Pilanci & Ergen (2020) to multiple nonlinear layer without any strong assumptions in contrast to Ergen & Pilanci (2021c) and a much broader class of NN architectures including ResNets. • As already observed by Pilanci & Ergen (2020) ; Ergen & Pilanci (2021c) , regularized deep ReLU network training problems require exponential-time complexity when the data matrix is full rank, which is unavoidable. However, in this paper, we develop an approximate training algorithm which has fully polynomial-time complexity in all data dimensions and prove global optimality guarantees in Theorem 2. To the best of our knowledge, this is the first convex optimization based and polynomial-time complexity (in data dimensions) training algorithm for ReLU networks with global approximation guarantees. • We show that the equivalent convex problem is regularized by a group norm regularization where grouping effect is among the sub-networks. Therefore the equivalent convex formulation reveals an implicit regularization that promotes group sparsity among sub-networks and generalizes prior works on linear networks such as Dai et al. (2021) to ReLU networks. • We derive a closed-form mapping between the parameters of the non-convex parallel ReLU networks and its convex equivalent in Proposition 1. Therefore, instead of solving the challenging non-convex problem, one can globally solve the equivalent convex problem and then construct an optimal solution to the original non-convex network architecture via our closed-form mapping.

2.2. OVERVIEW OF OUR RESULTS

Given data X ∈ R n×d and labels y ∈ R n , we consider the following regularized training problem p * L := min θ∈Θ L (f θ (X), y) + βR(θ) , where Θ := {θ ∈ Θ : W lk ∈ R m l-1 ×m l , ∀l ∈ [L], ∀k ∈ [K]} is the parameter space, L (•, • ) is an arbitrary convex loss function, R(•) represents the regularization on the network weights, and β > 0 is the regularization coefficient. For the rest of the paper, we focus on a scalar output regression/classification framework with arbitrary loss functions, e.g., squared loss, cross entropy or hinge loss. We also note that our derivations can be straightforwardly extended to vector outputs networks as proven in Appendix A.11. More importantly, we use 2 -path regularizer studied in Neyshabur et al. (2015b; a) , which is defined as R(θ) := K k=1 j1,j2,...,j L w 1kj1 2 2 L l=2 w 2 lkj l-1 j l , where w lkj l-1 j l is the j l-1 j th l entry of W lk . The above regularizer sums the square of all the parameters along each possible path from input to output of each sub-network k (see Figure 2 ) and then take the squared root of the summation. Therefore, we penalize each path in each sub-network and then group them based on the sub-network index k. We now propose a scaling to show that (2) is equivalent to a group 1 regularized problem. Lemma 1. The following problems are equivalentfoot_1 : min θ∈Θ L(f θ (X),y)+βR(θ) = min θ∈Θs L(f θ (X),y)+β K k=1 w Lk 2 , where w Lk ∈ R m L-1 are the last layer weights of each sub-network k, and Θ s := {θ ∈ Θ : j1,j2,...,j L-2 w 1kj1 2 2 L-1 l=2 w 2 lkj l-1 j l ≤ 1, ∀j L-1 , ∀k ∈ [K]} denotes the parameter space after rescaling. The advantage of the form in Lemma 1 is that we can derive a dual problem with respect to the output layer weights w Lk and then characterize the optimal layer weights via optimality conditions and the prior works on 1 regularization in infinite dimensional spaces (Rosset et al., 2007) . Thus, we first apply the rescaling in Lemma 1 and then take the dual with respect to the output weights w Lk . To characterize the hidden layer weights, we then change the order of minimization for the hidden layer weights and the maximization for the dual parameter to get the following dual problemfoot_2  p * L ≥ d * L := max v -L * (v) s.t. max θ∈Θs v T (XW 1 ) + . . . W (L-1) + 2 ≤ β, where L * is the Fenchel conjugate function of L, which is defined as (Boyd & Vandenberghe, 2004 ) L * (v) := max z z T v -L (z, y) . The dual problem in (3) is critical for our derivations since it provides us with an analytic perspective to characterize a set of optimal hidden layer weights for the non-convex neural network in (1). To do so, we first show that strong duality holds for the non-convex training problem in Lemma 1, i.e., p * L = d * L . Then, based on the exact dual problem in (3), we propose an equivalent analytic description for the optimal hidden layer weights via the KKT conditions. We note that strong duality for two-layer ReLU networks has already been proved by previous studies (Wang et al., 2021; Ergen & Pilanci, 2021c; Pilanci & Ergen, 2020; Ergen & Pilanci, 2021b; Zhang et al., 2019; Bach, 2017) , however, this is the first work providing an exact characterization for path regularized deep ReLU networks via convex duality.

3. PARALLEL NETWORKS WITH THREE LAYERS

Here, we consider a three-layer parallel network with K sub-networks, which is a special case of (1) when L = 3. Thus, we have the following training problem p * 3 =min θ∈Θ L(f θ (X),y)+β K k=1 j1,j2 w 1kj1 2 2 w 2 2kj1j2 w 2 3kj2 . ( ) where Θ = {(W 1k , W 2k , w 3k ) : W 1k ∈ R d×m1 , W 2k ∈ R m1×m2 , w 3k ∈ R m2 }. By Lemma 1, p * 3 = min θ∈Θs L (f θ (X), y) + β K k=1 w 3k 2 . ( ) Then, taking the dual of (5) with respect to the output layer weights w 3k ∈ R m2 and then changing the order of the minimization for {W 1k , W 2k } and the maximization for the dual variable v yields Here, we have three samples in two dimensions and we want to separate these samples with a linear classifier. D i basically encodes information regarding which side of the linear classifier samples lie. p * 3 ≥ d * 3 := max v -L * (v) s.t. max θ∈Θs v T (XW 1 ) + W 2 + 2 ≤ β. ( ) Here, we remark that ( 4) is non-convex with respect to the layer weights, we may have a duality gap, i.e., p * 3 ≥ d * 3 . Therefore, we first show that strong duality holds in this case, i.e., p * 3 = d * 3 as detailed in Appendix A.4. We then introduce an equivalent representation for the ReLU activation as follows. Since ReLU masks the negative entries of inputs, we have the following equivalent representation (XW 1 ) + w 2 + =   m1 j1=1 (Xw 1j1 ) + w 2j1   + =   m1 j1=1 (X wj1 ) + I j1   + = D 2 m1 j1=1 I j1 D 1j1 X wj1 , where wj1 = |w 2j1 |w 1j1 , I j1 = sign(w 2j1 ) ∈ {-1, +1}, and we use the following alternative representation for ReLU (see Figure 3 for a two dimensional visualization) (Xw) + =DXw ⇐⇒ DXw≥0 (I n -D)Xw≤0 ⇐⇒ (2D-I n )Xw≥0, where D ∈ R n×n is a diagonal matrix of zeros/ones, i.e., D ii ∈ {0, 1}. Therefore, we first enumerate all possible signs and diagonal matrices for the ReLU layers and denote them as I j1 , D 1ij1 and D 2l respectively, where j 1 ∈ [m 1 ], i ∈ [P 1 ], l ∈ [P 2 ]. Here, D 1ij1 and D 2l denotes the masking/diagonal matrices for the first and second ReLU layers, respectively and P 1 and P 2 are the corresponding total number diagonal matrices in each layer as detailed in Section A.10. Then, we convert the non-convex dual constraints in (6) to a convex constraint using D 1ij1 , D 2l and I j1 . Using the representation in (7), we then take the dual of (6) to obtain the convex bidual of the primal problem (4) as detailed in the theorem below. Theorem 1. The non-convex training problem in (4) can be cast as the following convex program min z,z ∈C L X(z -z ), y + β √ m 2 ( z F,1 + z F,1 ), where • F,1 denotes a d×m 1 dimensional group Frobenius norm operator such that given a vector u ∈ R dm1P , u F,1 := P i=1 U i F , where U i ∈ R d×m1 are reshaped partitions of u. Moreover, the convex set C is defined as C := {z : z s ij1l ∈ C s il , ∀i ∈ [P 1 ], l ∈ [P 2 ], s ∈ [M ]} C s il := {w j1 } j1 : (2D 2l -I n ) m1 j1=1 I s ij1l D 1ij1 Xw j1 ≥ 0, (2D 1ij1 -I n )Xw j1 ≥ 0, ∀j 1 ∈ [m 1 ] where I s ij1l ∈ {+1, -1}, M = 2 m1 , and z, z ∈ R dm1M P1P2 are constructed by stacking z s ij1l , z s ij1l ∈ R d , ∀i ∈ [P 1 ], l ∈ [P 2 ], j 1 ∈ [m 1 ], s ∈ [M ], respectively. Also, the effective data matrix X ∈ R n×dm1M P1P2 is defined as X := I M ⊗ Xs , Xs := [D 21 D 111 X . . . D 2l D 1ij1 X . . . D 2P2 D 1P1m1 X] . We next derive a mapping between the convex program ( 8) and the non-convex architecture (4). Proposition 1. An optimal solution to the non-convex parallel network training problem in (4), denoted as {W * 1k , w * 2k , w * 3k } K k=1 , can be recovered from an optimal solution to the convex program in (8), i.e., {z * , z * } via a closed-form mapping. Therefore, we prove a mapping between the parameters of the parallel network in Figure 2 and its convex equivalent. Next, we prove that the convex program in (8) can be globally optimized with a polynomial-time complexity given X has fixed rank, i.e., rank(X) = r < min{n, d}. Proposition 2. Given a data matrix such that rank(X) = r < min{n, d}, the convex program in (8) can be globally optimized via standard convex solvers with O(d 3 m 3 1 m 3 2 2 3(m1+1)m2 n 3(m1+1)r ) complexity, which is a polynomial-time complexity in terms of n, d. Note that here globally optimizing the training objective means to achieve the exact global minimum up to any arbitrary machine precision or solver tolerance. Below, we show that the complexity analysis in Proposition 2 extends to arbitrarily deep networks. Corollary 1. The same analysis can be readily applied to arbitrarily deep networks. Therefore, given rank(X) = r < min{n, d}, we prove that L-layer architectures can be globally optimized with O d 3 L-2 j=1 m 3 j 2 3 L-1 j=1 mj n 3r(1+ L-2 l=1 l j=1 mj ) , which is polynomial in n, d.

3.1. POLYNOMIAL-TIME TRAINING FOR ARBITRARY DATA

Based on the analysis in Corollary 1, exponential complexity is unavoidable for deep networks when the data matrix is full rank, i.e., rank(X) = min{n, d}. Thus, we propose a low rank approximation to the model in (4). We first denote the rank-r approximation of X as Xr such that X -Xr 2 ≤ σ r+1 , where σ r+1 represents the (r + 1) th largest singular value of X. Then, we have the following result. Theorem 2. Given an R-Lipschitz convex loss function L (•, y), the regularized training problem p * 3 =min θ∈Θ L(f θ (X),y)+β K k=1 j1,j2 w 1kj 2 2 w 2 2kj1j2 w 2 3kj2 , can be solved using the data matrix Xr to achieve the following optimality guarantee p * 3 ≤ p r ≤ p * 3 1 + √ m 1 m 2 Rσ r+1 β 2 , ( ) where p r denotes the objective value achieved by the parameters trained using Xr . Remark 2. Theorem 1 and 2 imply that for a given arbitrary rank data matrix X, the regularized training problem in (4) can be approximately solved by convex solvers to achieve a worst-case approximation p * 3 1 + RRσr+1 β 2 with complexity O(d 3 m 3 1 2 3(m1+1) n 3(m1+1)r ), where r min{n, d}. Therefore, even for full rank data matrices where the complexity is exponential in n or d, one can approximately solve the convex program in (8) in polynomial-time. Moreover, we remark that the approximation error proved in Theorem 2 can be arbitrarily small for practically relevant problems. As an example, consider a parallel network training problem with 2 loss function, then the upperbound becomes (1 + √ m1m2σr+1 β ) 2 , which is typically close to one due to fast decaying singular values in practice (see Figure 4 ).

3.2. REPRESENTATIONAL POWER: TWO VERSUS THREE LAYERS

Here, we provide a complete explanation for the representational power of three-layer networks by comparing with the two-layer results in Pilanci & Ergen (2020) . We first note that three-layer networks have substantially higher expressive power due to the non-convex interactions between hidden layers as detailed in Allen-Zhu et al. (2019); Pham & Nguyen (2021) . Furthermore, Belilovsky et al. (2019) show that layerwise training of three-layer networks can achieve comparable performance Table 2 : Training objective of a three-layer parallel network trained with non-convex SGD (5 independent initialization trials) on a toy dataset with (n, d, m 1 , m 2 , β, batch size) = (5, 2, 3, 1, 0.002, 5), where the convex program in (8) are solved via the interior point solvers in CVX/CVXPY.

Method

Non-convex SGD Convex(Ours) Run #1 Run #2 Run #3 Run #4 Run #5 to deeper models, e.g., VGG-11, on Imagenet. There exist several studies analyzing two-layer networks, however, despite their empirical success, a full theoretical understanding and interpretation of three-layer networks is still lacking in the literature. In this work, we provide a complete characterization for three-layer networks through the lens of convex optimization theory. To understand their expressive power, we compare our convex program for three-layer networks in (8) with its two-layer counterpart in Pilanci & Ergen (2020) . K = 5 Pilanci & Ergen (2020) analyzes two-layer networks with one ReLU layer, so that the data matrix X is multiplied with a single diagonal matrix (or hyperplane arrangement) D i . Thus, the effective data matrix is in the form of Xs = [D 1 X . . . D P X]. However, since our convex program in (8) has two nonlinear ReLU layers, the composition of these two-layer can generate substantially more complex features via locally linear variables {w s ij1l } multiplying the d-dimensional blocks of the columns of the effective data matrix Xs in Theorem 1. Although this may seem similar to the features in Ergen & Pilanci (2021c) , here, we have 2 m1 variables for each linear region unlike Ergen & Pilanci (2021c) which employ 2 variables per linear region. Moreover, Ergen & Pilanci (2021c) only considers the case where the second hidden layer has only one neuron, i.e., m 2 = 1, therefore do not consider standard three layer or deeper networks. Hence, we exactly describe the impact of having one more ReLU layer and its contribution to the representational power of the network. 

4. EXPERIMENTS

In this sectionfoot_3 , we present numerical experiments corroborating our theory. Low rank model in Theorem 2: To validate our claims, we generate a synthetic dataset as follows. We first randomly generate a set of layer weights for a parallel ReLU network with K = 5 sub-networks by sampling from a standard normal distribution. We then obtain the labels as y = k (XW 1k ) + W 2k + w 3k + 0.1 , where ∼ N (0, I n ). To promote a low rank structure in the data, we first sample a matrix from the standard normal distribution and then set σ r+1 = . . . = σ d = 1. We consider a regression framework with 2 loss and (n, r, β, m 1 , m 2 ) = (15, d/2, 0.1, 1, 1) and present the numerical results in Figure 4 . Here, we observe that the low rank approximation of the objective p r is closer to p * 3 than the worst-case upper-bound predicted by Theorem 2. However, in Figure 4b , the low rank approximation provides a significant reduction in the number of hyperplane arrangements, and therefore in the complexity of solving the convex program. Toy dataset: We use a toy dataset with 5 samples and 2 features, i.e., (n, d) = (5, 2). To generate the dataset, we forward propagate i.i.d. samples from a standard normal distribution, i.e., x i ∼ N (0, I d ), through a parallel network with 3 layers, 5 sub-networks, and 3 neurons, i.e., (L, K, m 1 , m 2 ) = (3, 5, 3, 1). We then train the parallel network in (4) on this toy dataset using both our convex program (8) and non-convex SGD. We provide the training objective and wall-clock time in Table 2 , where we particularly include 5 initialization trials for SGD. This experiment shows that when the number of sub-networks K is small, SGD trials fail to converge to the global minimum achieved by our convex program. However, as we increase K, the number of trials converging to global minimum gradually increases. Therefore, we show the benign overparameterization impact.

Image classification:

We conduct experiments on benchmark image datasets, namely CIFAR-10 ( Krizhevsky et al., 2014) and Fashion-MNIST (Xiao et al., 2017) . We particularly consider a ten class classification task and use a parallel network with 40 sub-networks and 100 hidden neurons, i.e., (K, m 1 , m 2 ) = (40, 100, 1). In Figure 5 , we plot the test accuracies against wall-clock time, where we include several different optimizers as well as SGD. Moreover, we include a parallel network trained with SGD/Adam and Weight Decay (WD) regularization to show the effectiveness of path regularization in (4). We first note that our convex approach achieves both faster convergence and higher final test accuracies for both dataset. However, the performance gain for Fashion-MNIST seems to be significantly less compared to the CIFAR-10 experiment. This is due to the nature of these datasets. More specifically, since CIFAR-10 is a much more challenging dataset, the baseline accuracies are quite low (around ∼ 50%) unlike Fashion-MNIST with the baseline accuracies around ∼ 90%. Therefore, the accuracy improvement achieved by the convex program seems low in Figure 5b . We also observe that weight decay achieves faster convergence rates however path regularization yields higher final test accuracies. It is normal to have faster convergence with weight decay since it can be incorporated into gradient-based updates without any computational overhead.

5. RELATED WORK

Parallel neural networks were previously investigated by Zhang et al. (2019) ; Haeffele & Vidal (2017) . Although these studies provided insights into the solutions, they require assumptions, e.g., sparsity among sub-networks in Theorem 1 of Haeffele & Vidal (2017)) and linear activations and hinge loss assumptions in Zhang et al. (2019) , which invalidates applications in practice. Recently, Pilanci & Ergen (2020) studied weight decay regularized two-layer ReLU network training problems and introduced polynomial-time trainable convex formulations. However, their analysis is restricted to standard two-layer ReLU networks, i.e., in the form of f θ (X) = (XW 1 ) + w 2 . The reasons for this restriction is that handling more than one ReLU layer is a substantially more challenging optimization problem. As an example, a direct extension of Pilanci & Ergen (2020) to three-layer NNs will yield doubly exponential complexity, i.e., O(n rn r ) for a rank-r data matrix, due to the combinatorial behavior of multiple ReLU layers. Thus, they only examined the case with a single ReLU layer (see Table 1 for details and the other relevant references in Pilanci & Ergen (2020) ). In addition, since they only considered standard two-layer ReLU networks, their analysis is not valid for a broader range of NN-based architectures as detailed in Remark 1. Later on, Ergen & Pilanci (2021c) extended this approach to three-layer ReLU networks. However, since they analyzed 2 -norm regularized training problem, they had to put unit Frobenius norm constraints on the first layer weights, which does not reflect the settings in practice. In addition, their analysis is restricted to the networks with a single neuron in the second layer (i.e. m 2 = 1) that is in the form of f θ (X) = k (XW 1k ) + w 2k + w 3k . Since this architecture only allows a single neuron in the second layer, each sub-network k has an expressive power that is equivalent to a standard two-layer network rather than three-layer. This can also be realized from the definition of the constraint set C in Theorem 1. Specifically, the convex set C in Ergen & Pilanci (2021c) has decoupled constraints across the hidden layer index j 1 whereas our formulation sums the responses over hidden neurons before feeding through the next layer as standard deep networks do. Therefore, this analysis does not reflect the true power of deep networks with L > 2. Moreover, the approach in Ergen & Pilanci (2021c) has exponential complexity when the data matrix has full rank, which is unavoidable. However, we analyze deep neural networks in (1) without any assumption on the weights. Furthermore, we develop an approximate training algorithm which has fully polynomial-time complexity in data dimensions and prove strong global optimality guarantees for this algorithm in Theorem 2.

6. CONCLUDING REMARKS

We studied the training problem of path regularized deep parallel ReLU networks, which includes ResNets and standard deep ReLU networks as its special cases. We first showed that the non-convex training problem can be equivalently cast as a single convex optimization problem. Therefore, we achieved the following advantages over the training on the original non-convex formulation: 1) Since our model is convex, it can be globally optimized via standard convex solvers whereas the non-convex formulation trained with optimizers such as SGD might be stuck at a local minimum, 2) Thanks to convexity, our model does not require any sort of heuristics and additional tricks such as learning rate schedule and initialization scheme selection or dropout. More importantly, we proposed an approximation to the convex program to enable fully polynomial-time training in terms of the number of data samples n and feature dimension d. Thus, we proved the polynomial-time trainability of deep ReLU networks without requiring any impractical assumptions unlike Pilanci & Ergen (2020); Ergen & Pilanci (2021c) . Notice that we derive an exact convex program only for three-layer networks, however, recently Wang et al. (2021) proved that strong duality holds for arbitrarily deep parallel networks. Therefore, a similar analysis can be extended to deeper networks to achieve an equivalent convex program, which is quite promising for future work. Additionally, although we analyzed fully connected networks in this paper, our approach can be directly extended to various NN architectures, e.g., convolution networks (Ergen & Pilanci, 2021a) , generative adversarial networks (Sahiner et al., 2021a) , NNs with batch normalization (Ergen et al., 2021) , and autoregressive models (Gupta et al., 2021) . Table 3 : Test accuracies for UCI experiments ((m 1 , m 2 , K, β) = (100, 1, 40, 0.5) and 80% -20% training-test split). Here, we present the standard non-convex architectures and the proposed convex counterpart trained with SGD and Adam optimizers. If one approach achieves higher accuracy on a certain dataset, we display the corresponding accuracy value in bold font. We observe that our convex approach achieves either higher or the same accuracy for 20 and 19 datasets (out of 21 datasets) when trained with SGD and Adam, respectively Details for the experiments in Table 2 and Figure 5 : We first note that for the experiments in Table 2 , we use CVX/CVPXY (Grant & Boyd, 2014; Diamond & Boyd, 2016; Agrawal et al., 2018) to globally solve the proposed convex program in (8). For these experiments, we use a laptop with i7 processor and 16GB of RAM. In order to tune the learning rate of SGD/GD, we first perform training with a bunch of learning different learning rates and select the one with the best performance on the validation datasets, which is 0.005 in these experiments. For comparatively larger scale image classification experiments in Figure 5 , we utilize a cluster GPU with 50GB of memory. However since the equivalent convex program in (8) has constraint which are challenging to handle for these datasets, we propose the following unconstrained convex problem which has the same global minima with the constrained version as discussed in Gupta et al. ( 2021) min z,z ∈R dm 1 M P 1 P 2 L X (z -z) , y + β ( z G,1 + z G,1 ) + λ (h C (z) + h C (z )) where λ > 0 coefficient to penalize the violated constraints and h C (z) is a function to sum the absolute value of all constraint violations defined as h C (z) := 1 T i,j1,l,s -(2D 1ij1 -I n )Xz s ij1l + + 1 T i,l,s   - j1 I s ij1l (2D 2l -I n )D 1ij1 Xz s ij1l   + . Thus, we obtain an unconstrained version of convex optimization problem ( 11), where one can use commonly employed first-order gradient based optimizers, e.g., SGD and Adam, available in deep learning libraries such PyTorch and Tensorflow. For both CIFAR-10 and Fashion-MNIST, we use the same training and test splits in the original datasets. We again perform a grid search to tune the learning rate, where the best performance is achieved by the following choices (µ Convex , µ SGD , µ Adam , µ Adagrad , µ Adadelta , µ W D SGD ) = (5e -7, 5e -3, 2e -5, 2e -3, 3e -1, 1) and (µ Convex , µ SGD , µ Adam , µ Adagrad , µ Adadelta , µ W D SGD ) = (1e -5, 2e -1, 2e -3, 1e -2, 3 , 1) as the learning rates for CIFAR-10 and Fashion-MNIST, respectively. We choose the momentum coefficient of the SGD optimizer as 0.9. In addition to this, we set P 1 P 2 = K and λ = 1e -5. More importantly, we remark that these experiments are performed by using a small sampled subset of hyperplane arrangements rather than sampling all possible arrangements as detailed in Remark 3.3 of Pilanci & Ergen (2020) . In particular, we first generate random weight matrices from a multivariate standard normal distribution and then solve the convex program using only the arrangements of the sampled weight matrices. Details for the experiments in Figure 6 : Layerwise training with shallow neural networks was proven to work remarkably well. Particularly, Belilovsky et al. (2019) shows that one can train threelayer neural networks sequentially to build deep networks that outperform end-to-end training with SOTA architectures. In Figure 6 , we apply this layerwise training procedure with our convex training approach. In particular, each stage in this figure is our three-layer convex formulation in Theorem 1. Here, we use the same experimental setting in the previous section. We observe that making the network deeper by stacking convex layers resulted in significant performance improvements. Specificaly, at the fifth stage, we achieved almost 85% accuracy for CIFAR-10 unlike below 60% accuracies in Figure 5 .

A.2 PARALLEL RELU NETWORKS

The parallel networks f θ (X) models a wide range of NNs in practice. As an example, standard NNs and ResNets (He et al., 2016) are special cases of this network architecture. To illustrate this, let us consider a parallel ReLU with two sub-networks and four layers, i.e., K = 2 and L = 4. If we set W lk = W l ∀k ∈ [2], l ∈ [4] then our architecture reduces to a standard four-layer network 2 k=1 (XW 1k ) + W 2k + W 3k + w 4k = (XW 1 ) + W 2 + W 3 + w 4 , where W 1 ∈ R d×m1 , W 2 ∈ R m1×m2 , W 3 ∈ R m2×m3 , w 4 ∈ R m3 . For ResNets, we first remark that since residual blocks are usually used after a ReLU activation function, which is positively homogeneous of degree one, in practice, each residual block takes only nonnegative entries as its inputs. Thus, we can assume X ∈ R n×d + without loss of generality. We also assume that weights obey the following form: W 11 = W 1 , W 21 = W 2 , W 12 = W 22 = I d , W 31 = W 32 = W 3 , and w 41 = w 42 = w 4 then f θ (X) = 2 k=1 (XW 1k ) + W 2k + W 3k + w 4k = (XW 1 ) + W 2 + W 3 + w 4 + (XW 3 ) + w 4 which is a shallow ResNet as demonstrated in Figure 1 of Veit et al. (2016) .

A.3 PROOF OF LEMMA 1

Let us first define r kj L-1 := j1,...,j L-2 w 1kj1 2 2 L-1 l=2 w 2 lkj l-1 j l > 0. Notice that if r kj L-1 = 0, this means that k th sub-network does not contribute the output of the parallel network in (1). Therefore, we can remove the paths with r kj L-1 = 0 without loss of generality. Now, we use the following change of variable W lk = W lk , ∀l ∈ [L -2], w (L-1)kj L-1 = w (L-1)kj L-1 r kj L-1 , ∀l ∈ [L -1], w Lkj L-1 = r kj L-1 w Lkj L-1 . We now note that j1,...,j L-2 w 1kj1 2 2 L-1 l=2 w 2 lkj l-1 j l = 1 r 2 kj L-1 j1,...,j L-2 w 1kj1 2 2 L-1 l=2 w 2 lkj l-1 j l = 1. Then, (2) can be restated as follows p * L = min {{W lk } L l=1 } K k=1 L K k=1 (XW 1k ) + . . . W (L-1)k + w Lk , y + β K k=1 j 1 ...,j L w 1kj 1 2 2 L l=2 w 2 lkj l-1 j l = min {{W lk } L-1 l=1 } K k=1 {w Lk } K k=1 L   K k=1 j L-1 (XW 1k ) + . . . w (L-1)kj L-1 + r kj L-1 w Lkj L-1 , y   + β K k=1 j L-1 w 2 Lkj L-1 j 1 ,...,j L-2 w 1kj 1 2 2 L-1 l=2 w 2 lkj l-1 j l r 2 kj L-1 = min {{W lk } L-1 l=1 } K k=1 {w Lk } K k=1 L   K k=1 j L-1 (XW 1k ) + . . . r -1 kj L-1 w (L-1)kj L-1 + w Lkj L-1 , y   + β K k=1 w Lk 2 = min {{W lk } L l=1 } K k=1 (W 1k ,...,W (L-1)k )∈Θs,∀k L K k=1 XW 1k + . . . W (L-1)k + w Lk , y + β K k=1 w Lk 2, where Θ s := (W 1 , . . . , W L-1 ) : j1,...,j L-2 w 1j1 2 2 L-1 l=2 w 2 lj l-1 j l = 1, ∀j L-1 ∈ [m L-1 ] . We also note that one can relax the equality constraint as an inequality constraint without loss of generality. This is basically due to the fact that if a constraint is not tight, i.e., strictly less than one, at the optimum then we can remove that constraint and make the corresponding output layer weight arbitrarily small via a simple scaling to make the objective value smaller. However, this would lead to a contradiction since this scaling further reduces the objective, which means that the initial set of layer weights (that yields a strict inequality in the constraints) are not optimal.

A.4 PROOF OF THEOREM 1

To obtain the bidual problem of (4), we first utilize semi-infinite duality theory as follows. We first compute the dual of ( 6) with respect to the dual parameter v to get p * ∞ := min w3 2≤1 min µ L θ∈Θs (XW 1 ) + W 2 + w 3 dµ(θ), y + β µ T V , where µ T V denotes the total variation norm of the signed measure µ. Notice that ( 12) is an infinite-dimensional neural network training problem similar to the one studied in Bach (2017) . More importantly, this problem is convex since the model is linear with respect to the measure µ and the loss and regularization functions are convex (Bach, 2017) . Thus, we have no duality gap, i.e., d * 3 = p * ∞ . In addition to this, even though ( 12) is an infinite-dimensional convex optimization problem, it reduces to a problem with at most n + 1 neurons at the optimum due to Caratheodory's theorem (Rosset et al., 2007) . Therefore, ( 12) can be equivalently stated as the following finite-size convex optimization problem p * ∞ = min θ∈Θs L K * k=1 (XW 1k ) + W 2k + w 3k , y + β K * k=1 w 3k 2 = min θ∈Θs L (f θ (X), y) + β K * k=1 w 3k 2 , where K * ≤ n + 1. We further remark that given K ≥ K * , ( 13) and ( 5) are the same problems, which also proves strong duality as p * 3 = p * ∞ = d * 3 . In the remainder of the proof, we show that using an alternative representation for the ReLU activation, we can achieve a finite-dimensional convex bidual formulation. Now we restate the dual problem (6) as d * 3 = max v -L * (v) s.t. max θ∈Θs v T (XW 1 ) + W 2 + 2 ≤ β. We first note that using the representation in ( 7), the dual constraint in ( 14) can be written as max θ∈Θs v T (XW1) + W2 + 2 ≤ β ⇐⇒ max θ∈Θs m 2 j 2 =1 v T (XW1) + w2j 2 + 2 ≤ β ⇐⇒ max θ∈Θs m2 v T (XW1) + w2 + 2 ≤ β ⇐⇒ max I j 1 ∈{±1} max θ∈Θs m2   v T m 1 j 1 =1 Ij 1 (Xw1j 1 |w2j 1 |) + +   2 ≤ β ⇐⇒ max i∈[P 1 ] l∈[P 2 ] max I j 1 ∈{±1} max {w j 1 } j 1 ∈C il √ m2 v T D 2l m 1 j 1 =1 Ij 1 D1ij 1 Xwj 1 ≤ β where Θ s = {(W 1 , W 2 ) : m1 j1=1 w 1j1 2 2 w 2 2j1j2 ≤ 1, ∀j 2 ∈ [m 2 ]}, we apply a variable change as w j1 = |w 2j1 |w 1j1 and define the set C il as C il :={{wj 1 }j 1 : (2D 2l -In) m 1 j 1 =1 Ij 1 D1ij 1 Xwj 1 ≥ 0, (2D1ij 1 -In)Xwj 1 ≥ 0, ∀j1 ∈ [m1], m 1 j 1 =1 wj 1 2 2 ≤ 1}. We also note that P 1 and P 2 denote the number of possible hyperplane arrangement for the first and second ReLU layer (see Appendix A.10 for details).

Then we have

max θ∈Θs v T (XW1) + W2 + 2 ≤ β ⇐⇒ max i∈[P 1 ] l∈[P 2 ] I j 1 ∈{±1} max {w j 1 } j 1 ∈C il √ m2 v T D 2l m 1 j 1 =1 Ij 1 D1ij 1 Xwj 1 ≤ β, ⇐⇒ max {w s ij 1 l } j 1 ∈C s il √ m2v T D 2l m 1 j 1 =1 I s ij 1 l D1ij 1 Xw s ij 1 l ≤ β, max {w s ij 1 l } j 1 ∈C s il - √ m2v T D 2l m 1 j 1 =1 I s ij 1 l D1ij 1 Xw s ij 1 l ≤ β, , ∀i ∈ [P1], ∀l ∈ [P2], ∀s ∈ [M ], where we use M := |{±1} m1 | = 2 m1 to enumerate all possible sign patterns {I j1 } m1 j1=1 of the size m 1 . Using the equivalent representation above, we rewrite the dual (14) as max v -L * (v) s.t. max {w s ij 1 l }j 1 ∈C s il √ m 2 v T D 2l m1 j1=1 I s ij1l D 1ij1 Xw s ij1l ≤ β, ∀i, l, s max {w s ij 1 l }j 1 ∈C s il - √ m 2 v T D 2l m1 j1=1 I s ij1l D 1ij1 Xw s ij1l ≤ β, ∀i, l, s. Since the problem above is convex and satisfies the Slater's condition when all the parameters are set to zero, we have strong duality (Boyd & Vandenberghe, 2004) , and thus we can state ( 16) as min γ s il ≥0 γ s il ≥0 max v min {w s ij 1 l }j 1 ∈C s il {w s ij 1 l }j 1 ∈C s il -L(v) * + M s=1 P1 i=1 P2 l=1 γ s il   β - √ m 2 v T D 2l m1 j1=1 I s ij1l D 1ij1 Xw s ij1l   + M s=1 P1 i=1 P2 l=1 γ s il   β + √ m 2 v T D 2l m1 j1=1 I s ij1l D 1ij1 Xw s ij1l   . Due to Sion's minimax theorem (Sion, 1958) , we can change the order the inner minimization and maximization to obtain closed-form solutions for the maximization over the variable v. This yields the following problem min γ su il ≥0 min {w s ij 1 l }j 1 ∈C s il {w s ij 1 l }j 1 ∈C s il L   M s=1 P2 l=1 P1 i=1 m1 j1=1 √ m 2 D 2l D 1ij1 X(I s ij1l γ s il w s ij1l -I s ij1l γ s il w s ij1l ), y   + β M s=1 P1 i=1 P2 l=1 (γ s il + γ s il ). Finally, we introduce a set of variable changes changes as z s ij1l = √ m 2 γ s il I s ij1l w s ij1l and z s ij1l = √ m 2 γ s il I s ij1l w s ij1l such that (20) can be cast as the following convex problem min {z s ij 1 l }j 1 ∈C s il {z s ij 1 l }j 1 ∈C s il L   M s=1 P2 l=1 P1 i=1 m1 j1=1 D 2l D 1ij1 X(z s ij1l -z s ij1l ), y   + β √ m 2 M s=1 P1 i=1 P2 l=1   m1 j1=1 z s ij1l 2 2 + m1 j1=1 z s ij1l 2 2   , where the constraint set C s il are defined as C s il := {z j1 } j1 : (2D 2l -I n ) m1 j1=1 D 1ij1 Xz j1 ≥ 0, (2D 1ij1 -I n )XI s ij1l z j1 ≥ 0, ∀j 1 ∈ [m 1 ] . Notice that ( 21) is a constrained convex optimization problem with 2dm 1 M P 1 P 2 variables and 2n(m 1 + 1)M P 1 P 2 constraints in the set C s il .

A.5 PROOF OF PROPOSITION 1

In this section, we prove that once the convex program in ( 21) is globally optimized to obtain a set of optimal solutions {z s * ij1l , z s * ij1l } i,j1,l,s , one can recover an optimal solution to the non-convex training problem (4) via a simple closed-form mapping as detailed below  W * 1k = 1 m2 I s i1l z s * i1l I s i2l z s * i2l . . . I s im1l z s * im1l if 1 ≤ k ≤ M P 1 P 2 1 m2 I s i1l z s * i1l I s i2l z s * i2l . . . I s im1l z s * im1l if M P 1 P 2 + 1 ≤ k ≤ 2M P 1 P 2 W * 2k =                                if M P 1 P 2 + 1 ≤ k ≤ 2M P 1 P 2 w * 3k = 1 1 . . . 1 T if 1 ≤ k ≤ M P 1 P 2 -1 -1 . . . -1 T if M P 1 P 2 + 1 ≤ k ≤ 2M P 1 P 2 , where (s, l, i) =    k-1 P 1 P 2 + 1, k-1-(s-1)P 1 P 2 P 1 + 1, k -(s -1)P1P2 -(l -1)P1 if 1 ≤ k ≤ M P1P2 k -1 P 1 P 2 + 1, k -1-(s-1)P 1 P 2 P 1 + 1, k -(s -1)P1P2 -(l -1)P1 else with k = k -M P 1 P 2 . Hence, we achieve an optimal solution to the original non-convex training problem (4) as {W * 1k , W * 2k , w * 3k } 2M P1P2 k=1 , where W * 1k ∈ R d×m1 , W * 2k ∈ R m1×m2 , and w * 3k ∈ R m2 respectively. Next, we confirm that the proposed set of layer weights are indeed optimal by plugging them back to both the convex and non-convex objectives. We first verify that both the optimal convex and non-convex layer weights give the same network output as follows (8), i.e.,  w * 2 2kj1j2 w * 2 3kj2 = 1 √ m 2 M s=1 P1 i=1 P2 l=1   m1 j1=1 z s * ij1l 2 + m1 j=1 z s * ij1l 2   . Since {W * 1k , W * 2k , w * 3k } 2M P1P2 k=1 yields the same network output and regularization cost with the optimal parameters of the convex program in (21), we conclude that the proposed set parameters for the non-convex problem also achieves the optimal objective value p * 3 , i.e., p * 3 = L 2M P1P2 k=1 (XW * 1k ) + W * 2k + w * 3k , y + β 2M P1P2 k=1 m2 j2=1 m1 j1=1 w * kj1 2 2 w * 2 2kj1j2 w * 2 3kj2 . A.6 PROOF OF THEOREM 2 We start with defining the optimal parameters for the original and rank-k approximation of the rescaled problem in (5) as {(W * 1k , W * 2k , w * 3k )} K k=1 := argmin θ∈Θs L K k=1 (XW 1k ) + W 2k + w 3k , y + β K k=1 w 3k 2 {( Ŵ1k , Ŵ2k , ŵ3k )} K k=1 := argmin θ∈Θs L K k=1 Xr W 1k + W 2k + w 3k , y + β K k=1 w 3k 2 and the objective value achieved by the parameters trained using Xr as p r := L K k=1 X Ŵ1k + Ŵ2k + ŵ3k , y + β K k=1 ŵ3k 2 . Then, we have p * 3 = L K k=1 (XW * 1k ) + W * 2k + w * 3k , y + β K k=1 w * 3k 2 (i) ≤ L K k=1 X Ŵ1k + Ŵ2k + ŵ3k , y + β K k=1 ŵ3k 2 = p r (ii) ≤ L K k=1 Xr Ŵ1k + Ŵ2k + ŵ3k , y + (β + √ m 1 m 2 Rσ r+1 ) K k=1 ŵ3k 2 ≤ L K k=1 Xr Ŵ1k + Ŵ2k + ŵ3k , y + β K k=1 ŵ3k 2 1 + √ m 1 m 2 Rσ r+1 β (iii) ≤ L K k=1 Xr W * 1k + W * 2k + w * 3k , y + β K k=1 w * 3k 2 1 + √ m 1 m 2 Rσ r+1 β (iv) ≤ L K k=1 (XW * 1k ) + W * 2k + w * 3k , y + β K k=1 w * 3k 2 1 + √ m 1 m 2 Rσ r+1 β 2 = p * 3 1 + √ m 1 m 2 Rσ r+1 β 2 , where (i) and (iii) follow from the optimality definitions of the original and approximated problems in ( 22). In addition, (ii) and (iv) follow from the relations below L K k=1 X Ŵ1k + Ŵ2k + ŵ3k , y = L   K k=1 m 2 j 2 =1 X Ŵ1k + ŵ2kj 2 + ŵ3kj 2 - Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 + Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   (1) ≤ L   K k=1 m 2 j 2 =1 X Ŵ1k + ŵ2kj 2 + ŵ3kj 2 - Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   + L   K k=1 m 2 j 2 =1 Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   (2) ≤ R K k=1 m 2 j 2 =1 X Ŵ1k + ŵ2kj 2 + ŵ3kj 2 - Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 2 + L   K k=1 m 2 j 2 =1 Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   = R K k=1 m 2 j 2 =1 X Ŵ1k + ŵ2kj 2 + - Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 2 + L   K k=1 m 2 j 2 =1 Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   (3) ≤ R K k=1 m 2 j 2 =1 X Ŵ1k + ŵ2kj 2 + - Xr Ŵ1k + ŵ2kj 2 + 2 ŵ3kj 2 + L   K k=1 m 2 j 2 =1 Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   ≤ R max k∈[K],j 2 ∈[m 2 ] X Ŵ1k + ŵ2kj 2 + - Xr Ŵ1k + ŵ2kj 2 + 2 K k=1 ŵ3k 1 + L   K k=1 m 2 j 2 =1 Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   (4) ≤ R max k∈[K] X -Xr 2 m 1 j 1 =1 ŵ1kj 1 2 | ŵ2kj 1 j 2 | K k=1 ŵ3k 1 + L   K k=1 m 2 j 2 =1 Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   (5) ≤ √ m1R max k∈[K] X -Xr 2 m 1 j 1 =1 ŵ1kj 1 2 2 | ŵ2kj 1 j 2 | 2 K k=1 ŵ3k 1 + L   K k=1 m 2 j 2 =1 Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   (6) = √ m1Rσr+1 K k=1 ŵ3k 1 + L   K k=1 m 2 j 2 =1 Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   ≤ √ m1m2Rσr+1 K k=1 ŵ3k 2 + L   K k=1 m 2 j 2 =1 Xr Ŵ1k + ŵ2kj 2 + ŵ3kj 2 , y   , where we use the convexity and R-Lipschtiz property of the loss function, convexity of 2norm, 1-Lipschitz property of the ReLU activation, x 1 ≤ √ m x 2 for x ∈ R m , and max k m1 j1=1 2 2 ŵ2 2kj1j2 ≤ 1 from the rescaling in Lemma 1 for (1), ( 2), ( 3), ( 4), ( 5), and (6) respectively.

A.7 PROOF FOR THE DUAL PROBLEM IN (3)

In order to prove the dual problem, we directly utilize Fenchel duality (Boyd & Vandenberghe, 2004) . Let us first rewrite the primal regularized training problem after the application of the rescaling in Lemma 1 as follows p * L = min ŷ∈R n ,θ∈Θs L (ŷ, y) + β K k=1 w Lk 2 s.t. ŷ = K k=1 (XW 1k ) + . . . W (L-1)k + w Lk . ( ) The corresponding Lagrangian can be computed by incorporating the constraints into objective via a dual veriable as follows L(v, ŷ, w Lk ) = L (ŷ, y) -v T ŷ + v T K k=1 (XW 1k ) + . . . W (L-1)k + w Lk + β K k=1 w Lk 2 . We then define the following dual function g(v) = min ŷ,w Lk L(v, ŷ, w Lk ) = min ŷ,w Lk L (ŷ, y) -v T ŷ + v T K k=1 (XW 1k ) + . . . W (L-1)k + w Lk + β K k=1 w Lk 2 = -L * (v) s.t. v T (XW 1k ) + . . . W (L-1)k + 2 ≤ β, ∀k ∈ [K], where L * is the Fenchel conjugate function defined as (Boyd & Vandenberghe, 2004 ) L * (v) := max z z T v -L (z, y) . Hence, we write the dual problem of (23) as p * L = min θ∈Θs max v g(v) = min θ∈Θs max v -L * (v) s.t. v T (XW 1k ) + . . . W (L-1)k + 2 ≤ β, ∀k ∈ [K]. However, since the hidden layer weights are the variables of the outer minimization, we cannot directly characterize the optimal hidden layer weight in the form above. Thus, as the last step of the derivation, we change the order of the minimization over θ and the maximization over v to obtain the following lower bound p * L ≥ d * L = max v min θ∈Θs -L * (v) s.t. v T (XW 1k ) + . . . W (L-1)k + 2 ≤ β, ∀k ∈ [K] = max v -L * (v) s.t. max θ∈Θs v T (XW 1 ) + . . . W (L-1) + 2 ≤ β. A.8 HYPERPLANE ARRANGEMENTS Here, we review the notion of hyperplane arrangements detailed in Pilanci & Ergen (2020) . We first define the set of all hyperplane arrangements for the data matrix X as H := {sign(Xw)} : w ∈ R d , where |H| ≤ 2 n . The set H all possible {+1, -1} labelings of the data samples {x i } n i=1 via a linear classifier w ∈ R d . We now define a new set to denote the indices with positive signs for each element in the set H as S := {∪ hi=1 {i}} : h ∈ H . With this definition, we note that given an element S ∈ S, one can introduce a diagonal matrix D(S) ∈ R n×n defined as D(S) ii := 1 if i ∈ S 0 otherwise . D S) can be also viewed as a diagonal matrix of indicators, where each diagonal entry is one if the the corresponding sample labeled as +1 by the linear classifier w, zero otherwise. Therefore, the output of ReLU activation can be equivalently written as (Xw) + = D(S)Xw provided that D(S)Xw ≥ 0 and (I n -D(S)) Xw ≤ 0 are satisfied. One can define more compactly these two constraints as (2D(S) -I n ) Xw ≥ 0. We now denote the cardinality of S as P , and obtain the following upperbound , 2000; Stanley et al., 2004; Winder, 1966; Cover, 1965) . P ≤ 2 r-1 k=0 n -1 k ≤ 2r e(n -1) r r where r := rank(X) ≤ min(n, d) (Ojha A.9 LOW RANK MODEL IN THEOREM 2 In Section 3.1, we propose an -approximate training approach that has polynomial-time complexity even when the data matrix is full rank. Here, you can select the rank r by plugging in the desired approximation error and network structure in equation 10. We show that the approximation error proved in Theorem 2 can be arbitrarily small for practically relevant problems. As an example, consider a parallel architecture training problem with 2 loss function, then the upperbound becomes (1 + √ m1m2σr+1 β ) 2 , which can be arbitrarily close to one due to presence of noise component (with small σ r+1 ) in most datasets in practice (see Figure 4 for an empirical verification). This observation is also valid for several benchmark datasets, including MNIST, CIFAR-10, and CIFAR-100, which exhibit exponentially decaying singular values (see Figure 7 ) and therefore effectively has a low rank structure. In addition, singular values can be computed to set the target rank and the value of the regularization coefficient to obtain any desired approximation ratio using Theorem 2.

A.10 PROOF OF PROPOSITION 2 AND COROLLARY 1

We first review the multi-layer hyperplane arrangements concept introduced in Section 2.1 of Ergen & Pilanci (2021c) . Based on this concept, we then calculate the training complexity to globally solve the convex program in Theorem 1. If we denote the number of hyperplane arrangements for the first ReLU layer as P 1 , then from Appendix A.8 we know that P 1 ≤ 2r e(n -1) r r ≈ O(n r ). In order to obtain a bound for the number of hyperplane arrangements in the second ReLU layer we first note that preactivations of the second ReLU layer, i.e., (XW 1 ) + w 2 can be equivalently represented as a matrix-vector product form by using the effective data matrix X := [I 1 D 11 X I 2 D 12 X . . . I m1 D 1m1 X] due to the equivalent representation in (7). Therefore, given rank( X) = r 2 ≤ m 1 r P2 ≤ 2 However, notice that X is not a fixed data matrix since we can choose each diagonal D 1j1 among a set {D i } P1 i=1 of size P 1 due to (24) and the sign pattern I j1 among the set {+1, -1} of size 2. Thus, in the worst-case, we have the following upper-bound Notice that given fixed scalars m 1 and r, both P 1 and P 2 are polynomial terms with respect to the number of data samples n and the feature dimension d. Remark 3. Notice that Convolutional Neural Networks (CNNs) operate on the patch matrices {X b } B b=1 instead of the full data matrix X, where X b ∈ R n×h and h denotes the filter size. Hence, even when the data matrix is full rank, i.e., r = min{n, d}, the number of hyperplane arrangements P 1 is upperbounded as P 1 ≤ O(n rc ), where r c := max b rank(X b ) ≤ h min{n, d} (see Ergen & Pilanci (2021a) for details). For instance, let us consider a CNN with m 1 = 512 filters of size 3 × 3, then r c ≤ 9 independent of data dimension n, d. As a consequence, weight sharing structure in CNNs dramatically limits the number of possible hyperplane arrangements. This also explains efficiency and remarkable generalization performance of CNNs in practice. Training complexity analysis: Here, we calculate the computational complexity to globally solve the convex program in (8). Note that (8) is a convex optimization problem with 2dm 1 M P 1 P 2 variables and 2n(m 1 + 1)M P 1 P 2 constraints. Therefore, due to the upperbounds in ( 24) and (25) of Appendix A.8, the convex program (8) can be globally optimized by a standard interior-point solver with the computational complexity O(d 3 m 3 1 2 3(m1+1) n 3(m1+1)r ), which is a polynomial-time complexity in terms of n, d. The analysis in this section can be recursively extended to arbitrarily deep parallel networks. First notice that if we apply the same approach to obtain an upperbound on P 3 , then due to the multiplicative pattern in (25), we obtain P 3 ≤ P 3 (2P 2 ) m2 ≤ O(n m2m1r ). In a similar manner, the number of hyperplane arrangements in the l th layer is upperbounded as P l ≤ O(n r l-1 j=1 mj ), which is polynomial in both n and d for fixed data rank r and fixed layer widths {m j } l-1 j=1 .

A.11 EXTENSION TO VECTOR OUTPUTS

In this section, we extend the analysis to parallel networks with multiple outputs where the label matrix is defined as Y ∈ R n×C provided that there exist C classes/outputs. Then the primal nonconvex training problem is as follows where the corresponding Fenchel conjugate function is L * (V) := max Z trace Z T V -L (Z, Y) . Notice that above we have a dual matrix V instead of the dual vector in the scalar output case. More importantly, here, we have 2 norm in the dual constraint unlike the scalar output case with absolute value. Therefore, the vector-output case is slightly more challenging than the scalar output case and yields a different regularization function in the equivalent convex program. The rest of the derivations directly follows the steps in Section A.4 and Sahiner et al. (2021b) .



We analyze scalar outputs, however, our derivations extend to vector outputs as shown in Appendix A.11. All the proofs are presented in the supplementary file. We present the details in Appendix A.7. Details on the experiments can be found in Appendix A.1.



Figure 1: Decision boundaries of 2-layer and 3-layer ReLU networks that are globally optimized with weight decay (WD) and path regularization (PR). Here, our convex training approach in (c) successfully learns the underlying spiral pattern for each class while the previously studied convex models in (a) and (b) fail (see Appendix A.1 for details).

Figure 2: (Left): Parallel ReLU network in (1) with K sub-networks and three layers (L = 3) (Right): Path regularization for a three-layer network.

Figure 3: Illustration of possible hyperplane arrangements that determine the diagonal matrices D i . Here, we have three samples in two dimensions and we want to separate these samples with a linear classifier. D i basically encodes information regarding which side of the linear classifier samples lie.

Figure 4: Verification of Theorem 2 and Remark 2. We train a parallel network using the convex program in Theorem 1 with 2 loss on a toy dataset with n = 15, β = 0.1, m 2 = 1, and the low-rank approximation r = d 2 . To obtain a low-rank model, we first sample a data matrix from a standard normal Gaussian distribution and then set σ r+1 = . . . = σ d = 1.

Figure 5: Accuracy of a three-layer architecture trained using the non-convex formulation (4) and the proposed convex program (8), where we use (a) CIFAR-10 with (n, d, m 1 , m 2 , K, β, batch size) = (5x10 4 , 3072, 100, 1, 40, 10 -3 , 10 3 ) and (b) Fashion-MNIST with (n, d, m 1 , m 2 , K, β, batch size) = (6x10 4 , 784, 100, 1, 40, 10 -3 , 10 3 ). We note that the convex model is trained using (a) SGD and (b) Adam.

Figure 6: Test accuracy for convex layerwise training, where each stage is our three-layer convex formulation in Theorem 1. Here, we train three-layer neural networks sequentially using convex optimization to build deep networks.

2l D 1ij1 X z s * ij1lz s * ij1l .Now, we show that the proposed set of weight matrices for the non-convex problem achieves the same regularization cost with (21), i.e.,

Figure 7: The values of normalized singular values of the data matrix for the MNIST and CIFAR-10 datasets. As illustrated in both figures, the singular values follow an exponentially decaying trends indicating an effective low rank structure.

(V) s.t. V T (XW 1k ) + . . . W (L-1)k + F ≤ β, ∀k ∈ [K],

Supplementary Material

In this section, we provide new numerical results and detailed information about our experiments in the main paper.Decision boundary plots in Figure 1 : In order to visualize the capabilities of our convex training approach, we perform an experiment on the spiral dataset which is known to be challenging for 2-layer networks while 3-layer networks can readily interpolate the training data (i.e. exactly fit the training labels) (Carter & Shan). As the baselines of our analysis, we include the two-layer convex training approach in Pilanci & Ergen (2020) and recently introduced three-layer convex training approach (with weight decay regularization and several architectural and parametric assumptions) in Ergen & Pilanci (2021c) . As our experimental setup, we consider a binary classification task with y ∈ {+1, -1} n and squared loss. We choose (n, m 1 , m 2 , P 1 , P 2 , β) = (30, 5, 1, 11, 11, 1e -4) and for Pilanci & Ergen (2020) we use P = 50 neurons/hyperplane arrangements. We also use CVPXY with MOSEK solver (Grant & Boyd, 2014; Diamond & Boyd, 2016; Agrawal et al., 2018) to globally solve the convex programs. As demonstrated in Figure 1 , baselines methods, especially Ergen & Pilanci (2021c) , fit a function that is significantly different than the underlying spiral data distribution. This clearly shows that since Pilanci & Ergen (2020) is restricted two-layer networks and Ergen & Pilanci (2021c) have multiple assumptions, i.e., unit Frobenius norm constraints on the layer weights ( W lk F ≤ 1, ∀l ∈ [L -2]) and last hidden layer weights cannot be matrices (w (L-1)k ∈ R m L-2 ), both baseline approaches fail to reflect true expressive power of deep networks (L > 2). On the other hand, our convex training approach for path regularized networks fits a model that successfully describes the underlying data distribution for this challenging task.Additional experiments: We also conduct experiments on several datasets available in UCI Machine Learning Repository (Dua & Graff, 2017) , where we particularly selected the datasets from Fernández-Delgado et al. (2014) such that n ≤ 500. For these datasets, we consider a conventional binary classification framework with (m 1 , m 2 , K, β) = (100, 1, 40, 0.5) and compare the test accuracies of non-convex architectures trained with SGD and Adam with their convex counter parts in (8). For these experiments, we use the 80% -20% splitting ratio for the training and test sets. Furthermore, we train each algorithm long enough to reach training accuracy one. As shown in Table 3 , our convex approach achieves higher or the same test accuracy compared to the standard non-convex training approach for most of the datasets (precisely 20 and 19 out of 21 datasets for SGD and Adam, respectively). We also note that for this experiment, we used the unconstrained form in (11) with the approximate version in Remark 3.3 of Pilanci & Ergen (2020) .

