VECTOR-OUTPUT RELU NEURAL NETWORK PROB-LEMS ARE COPOSITIVE PROGRAMS: CONVEX ANAL-YSIS OF TWO LAYER NETWORKS AND POLYNOMIAL-TIME ALGORITHMS

Abstract

We describe the convex semi-infinite dual of the two-layer vector-output ReLU neural network training problem. This semi-infinite dual admits a finite dimensional representation, but its support is over a convex set which is difficult to characterize. In particular, we demonstrate that the non-convex neural network training problem is equivalent to a finite-dimensional convex copositive program. Our work is the first to identify this strong connection between the global optima of neural networks and those of copositive programs. We thus demonstrate how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and draw key insights from this formulation. We describe the first algorithms for provably finding the global minimum of the vector output neural network training problem, which are polynomial in the number of samples for a fixed data rank, yet exponential in the dimension. However, in the case of convolutional architectures, the computational complexity is exponential in only the filter size and polynomial in all other parameters. We describe the circumstances in which we can find the global optimum of this neural network training problem exactly with soft-thresholded SVD, and provide a copositive relaxation which is guaranteed to be exact for certain classes of problems, and which corresponds with the solution of Stochastic Gradient Descent in practice.

1. INTRODUCTION

In this paper, we analyze vector-output two-layer ReLU neural networks from an optimization perspective. These networks, while simple, are the building blocks of deep networks which have been found to perform tremendously well for a variety of tasks. We find that vector-output networks regularized with standard weight-decay have a convex semi-infinite strong dual-a convex program with infinitely many constraints. However, this strong dual has a finite parameterization, though expressing this parameterization is non-trivial. In particular, we find that expressing a vector-output neural network as a convex program requires taking the convex hull of completely positive matrices. Thus, we find an intimate, novel connection between neural network training and copositive programs, i.e. programs over the set of completely positive matrices (Anjos & Lasserre, 2011) . We describe algorithms which can be used to find the global minimum of the neural network training problem in polynomial time for data matrices of fixed rank, which holds for convolutional architectures. We also demonstrate under certain conditions that we can provably find the optimal solution to the neural network training problem using soft-thresholded Singular Value Decomposition (SVD). In the general case, we introduce a relaxation to parameterize the neural network training problem, which in practice we find to be tight in many circumstances.

1.1. RELATED WORK

Our analysis focuses on the optima of finite-width neural networks. This approach contrasts with certain approaches which have attempted to analyze infinite-width neural networks, such as the Neural Tangent Kernel (Jacot et al., 2018) . Despite advancements in this direction, infinite-width neural networks do not exactly correspond to their finite-width counterparts, and thus this method of analysis is insufficient for fully explaining their success (Arora et al., 2019) . Other works may attempt to optimize neural networks with assumptions on the data distribution. Of particular interest is (Ge et al., 2018) , which demonstrates that a polynomial number of samples generated from a planted neural network model is sufficient for extracting its parameters using tensor methods, assuming the inputs are drawn from a symmetric distribution. If the input distribution to a simple convolutional neural network with one filter is Gaussian, it has also been shown that gradient descent can find the global optimum in polynomial time (Brutzkus & Globerson, 2017) . In contrast to these works, we seek to find general principles for learning two-layer ReLU networks, regardless of the data distribution and without planted model assumptions. Another line of work aims to understand the success of neural networks via implicit regularization, which analyzes how models trained with Stochastic Gradient Descent (SGD) find solutions which generalize well, even without explicit control of the optimization objective (Gunasekar et al., 2017; Neyshabur et al., 2014) . In contrast, we consider the setting of explicit regularization, which is often used in practice in the form of weight-decay, which regularizes the sum of squared norms of the network weights with a single regularization parameter β, which can be critical for neural network performance (Golatkar et al., 2019) . Our approach of analyzing finite-width neural networks with a fixed training dataset has been explored for networks with a scalar output (Pilanci & Ergen, 2020; Ergen & Pilanci, 2020a; d) . In fact, our work here can be considered a generalization of these results. We consider a ReLU-activation two-layer network f : R d → R c with m neurons: f (x) = m j=1 (x u j ) + v j where the function (•) + = max(0, •) denotes the ReLU activation, {u j ∈ R d } m j=1 are the first-layer weights of the network, and {v j ∈ R c } m j=1 are the second-layer weights. In the scalar-output case, the weights v j are scalars, i.e. c = 1. Pilanci & Ergen (2020) find that the neural network training problem in this setting corresponds to a finite-dimensional convex program. However, the setting of scalar-output networks is limited. In particular, this setting cannot account for tasks such as multi-class classification or multi-dimensional regression, which are some of the most common uses of neural networks. In contrast, the vector-output setting is quite general, and even greedily training and stacking such shallow vector-output networks can match or even exceed the performance of deeper networks on large datasets for classification tasks (Belilovsky et al., 2019) . We find that this important task of extending the scalar case to the vector-output case is an exceedingly non-trivial task, which generates novel insights. Thus, generalizing the results of Pilanci & Ergen (2020) is an important task for a more complete knowledge of the behavior of neural networks in practice. Certain works have also considered technical problems which arise in our analysis, though in application they are entirely different. Among these is analysis into cone-constrained PCA, as explored by Deshpande et al. (2014) and Asteris et al. (2014) . They consider the following optimization problem max u u Ru s.t Xu ≥ 0; u 2 = 1 (2) This problem is in general considered NP-hard. Asteris et al. (2014) provide an exponential algorithm which runs in O(n d ) time to find the exact solution to (2), where X ∈ R n×d and R ∈ S d is a symmetric matrix. We leverage this result to show that the optimal value of the vector-output neural network training problem can be found in the worst case in exponential time with respect to r := rank(X), while in the case of a fixed-rank data matrix our algorithm is polynomial-time. In particular, convolutional networks with fixed filter sizes (e.g., 3 × 3 × m convolutional kernels) correspond to the fixed-rank data case (e.g., r = 9). In search of a polynomial-time approximation to (2), Deshpande et al. (2014) evaluate a relaxation of the above problem, given as max U R, U s.t XU X ≥ 0; tr(U ) = 1; U 0 (3) While the relaxation not tight in all cases, the authors find that in practice it works quite well for approximating the solution to the original optimization problem. This relaxation, in particular, corresponds to what we call a copositive relaxation, because it consists of a relaxation of the set C P CA = {uu : u 2 = 1, Xu ≥ 0}. When X = I and the norm constraint is removed, C P CA is the set of completely positive matrices (Dür, 2010) . Optimizing over the set of completely positive matrices is NP-hard, as is optimizing over its convex hull: C := conv{uu : Xu ≥ 0} Thus, optimizing over C is a convex optimization problem which is nevertheless NP-hard. Various relaxations to C have been proposed, such as the copositive relaxation used by (Deshpande et al., 2014) above: C := {U : U 0; XU X ≥ 0} In fact, this relaxation is tight, given that u ∈ R d and d ≤ 4 (Burer, 2015; Kogan & Berman, 1993) . However, C ⊂ C for d ≥ 5, so the copositive relaxation provides a lower bound in the general case. These theoretical results prove insightful for understanding the neural network training objective.

1.2. CONTRIBUTIONS

• We find the semi-infinite convex strong dual for the vector-output two-layer ReLU neural network training problem, and prove that it has a finite-dimensional exact convex optimization representation. • We establish a new connection between vector-output neural networks, copositive programs and cone-constrained PCA problems, yielding new insights into the nature of vector-output neural network training, which extend upon the results of the scalar-output case. • We provide methods that globally solve the vector-output neural network training problem in polynomial time for data matrices of a fixed rank, but for the full-rank case, the complexity is necessarily exponential in d assuming P = N P. • We provide conditions on the training data and labels with which we can find a closedform expression for the optimal weights of a vector-output ReLU neural network using soft-thresholded SVD. • We propose a copositive relaxation to establish a heuristic for solving the neural network training problem. This copositive relaxation is often tight in practice.

2. PRELIMINARIES

In this work, we consider fitting labels Y ∈ R n×c from inputs X ∈ R n×d with a two layer neural network with ReLU activation and m neurons in the hidden layer. This network is trained with weight decay regularization on all of its weights, with associated parameter β > 0. For some general loss function (f (X), Y ), this gives us the non-convex primal optimization problem p * = min uj ∈R d vj ∈R c 1 2 (f (X), Y ) + β 2 m j=1 u j 2 2 + v j 2 2 (4) In the simplest case, with a fully-connected network trained with squared loss 1 , this becomes: p * = min uj ∈R d vj ∈R c 1 2 m j=1 (Xu j ) + v j -Y 2 F + β 2 m j=1 u j 2 2 + v j 2 2 (5) 1 Appendix A.6 contains extensions to general convex loss functions. However, alternative models can be considered. In particular, for example, Ergen & Pilanci (2020d) consider two-layer CNNs with global average pooling, for which we can define the patch matrices {X k } K k=1 , which define the patches which individual convolutions operate upon. Then, the vectoroutput neural network training problem with global average pooling becomes p * conv = min uj ∈R d vj ∈R c 1 2 K k=1 m j=1 (X k u j ) + v j -Y 2 F + β 2 m j=1 u j 2 2 + v j 2 2 (6) We will show that in this convolutional setting, because the rank of the set of patches M := [X 1 , X 2 , • • • X K ] cannot exceed the filter size of the convolutions, there exists an algorithm which is polynomial in all problem dimensions to find the global optimum of this problem. We note that such matrices typically exhibit rapid singular value decay due to spatial correlations, which may also motivate replacing it with an approximation of much smaller rank. In the following section, we will demonstrate how the vector-output neural network problem has a convex semi-infinite strong dual. To understand how to parameterize this semi-infinite dual in a finite fashion, we must introduce the concept of hyper-plane arrangements. We consider the set of diagonal matrices D := {diag(1 Xu≥0 ) : u 2 ≤ 1} This is a finite set of diagonal matrices, dependent on the data matrix X, which indicate the set of possible arrangement activation patterns for the ReLU non-linearity, where a value of 1 indicates that the neuron is active, while 0 indicates that the neuron is inactive. In particular, we can enumerate the set of sign patterns as D = {D i } P i=1 , where P depends on X but is in general bounded by P ≤ 2r e(n -1) r r for r := rank(X) (Pilanci & Ergen, 2020; Stanley et al., 2004) . Note that for a fixed rank r, such as in the convolutional case above, P is polynomial in n. Using these sign patterns, we can completely characterize the range space of the first layer after the ReLU: {(Xu) + : u 2 ≤ 1} = {D i Xu : u 2 ≤ 1, (2D i -I)Xu ≥ 0, i ∈ [P ]} We also introduce a class of data matrices X for which the analysis of scalar-output neural networks simplifies greatly, as shown in (Ergen & Pilanci, 2020b) . These matrices are called spike-free matrices. In particular, a matrix X is called spike-free if it holds that {(Xu) + : u 2 ≤ 1} = {Xu : u 2 ≤ 1} ∩ R n + ( ) When X is spike-free, then, the set of sign patterns D reduces to a single sign pattern, D = {I}, because of the identity in (7). The set of spike-free matrices includes (but is not limited to) diagonal matrices and whitened matrices for which n ≤ d, such as the output of Zero-phase Component Analysis (ZCA) whitening. The setting of whitening the data matrix has been shown to improve the performance of neural networks, even in deeper settings where the whitening transformation is applied to batches of data at each layer (Huang et al., 2018) . We will see that spike-free matrices provide polynomial-time algorithms for finding the global optimum of the neural network training problem in both n and d (Ergen & Pilanci, 2020b) , though the same does not hold for vector-output networks.

2.1. WARM-UP: SCALAR-OUTPUT NETWORKS

We first present strong duality results for the scalar-output case, i.e. the case where c = 1. Theorem (Pilanci & Ergen, 2020) There exists an m * ≤ n + 1 such that if m ≥ m * , for all β > 0, the neural network training problem (5) has a convex semi-infinite strong dual, given by p * = d * := max z: |z (Xu)+|≤β ∀ u 2≤1 - 1 2 z -y 2 2 + 1 2 y 2 2 (8) Furthermore, the neural network training problem has a convex, finite-dimensional strong bi-dual, given by p * = min (2Di-I)Xwi≥0 ∀i∈[P ] (2Di-I)Xvi≥0 ∀i∈[P ] 1 2 P i=1 D i X(w i -v i ) -y 2 2 + β P i=1 w i 2 + v i 2 (9) This is a convex program with 2dP variables and 2nP linear inequalities. Solving this problem with standard interior point solvers thus has a complexity of O(d 3 r 3 ( n d ) 3r ), which is thus exponential in r, but for a fixed rank r is polynomial in n. In the case of a spike-free X, however, the dual problem simplifies to a single sign pattern constraint D 1 = I. Then the convex strong bi-dual becomes (Ergen & Pilanci, 2020b ) p * = min Xw≥0 Xv≥0 1 2 X(w -v) -y 2 2 + β w 2 + v 2 This convex problem has a much simpler form, with only 2n linear inequality constraints and 2d variables, which therefore has a complexity of O(nd 2 ). We will see that the results of scalar-output ReLU neural networks are a specific case of the vector-output case. 3 STRONG DUALITY

3.1. CONVEX SEMI-INFINITE DUALITY

Theorem 1 There exists an m * ≤ nc + 1 such that if m ≥ m * , for all β > 0, the neural network training problem (5) has a convex semi-infinite strong dual, given by p * = d * := max Z: Z (Xu)+ 2≤β ∀ u 2 ≤1 - 1 2 Z -Y 2 F + 1 2 Y 2 F (11) Furthermore, the neural network training problem has a convex, finite-dimensional strong bi-dual, given by p * = min Vi∈Ki ∀i∈[P ] 1 2 P i=1 D i XV i -Y 2 F + β P i=1 V i * (12) for convex sets K i K i := conv{ug : (2D i -I)Xu ≥ 0, g 2 ≤ 1} The strong dual given in ( 11) is convex, albeit with infinitely many constraints. In contrast, ( 12) is a convex problem has finitely many constraints. This convex model learns a sparse set of locally linear models V i which are constrained to be in a convex set, for which group sparsity and low-rankness over hyperplane arrangements is induced by the sum of nuclear-norms penalty. The emergence of the nuclear norm penalty is particularly interesting, since similar norms have also been used for rank minimization problems (Candès & Tao, 2010; Recht et al., 2010) , proposed as implicit regularizers for matrix factorization models (Gunasekar et al., 2017) , and draws similarities to nuclear norm regularization in multitask learning (Argyriou et al., 2008; Abernethy et al., 2009) , and trace Lasso (Grave et al., 2011) . We note the similarity of this result to that from Pilanci & Ergen (2020) , whose formulation is a special case of this result with c = 1, where K i reduce to K i = {u : (2D i -I)Xu ≥ 0} ∪ {-u : (2D i -I)Xu ≥ 0} from which we can obtain the convex program presented by Pilanci & Ergen (2020) . Further, this result extends to CNNs with global average pooling, which is discussed in Appendix A.3.2. Remark 1.1 It is interesting to observe that the convex program (12) can be interpreted as a piecewise low-rank model that is partitioned according to the set of hyperplane arrangements of the data matrix. In other words, a two-layer ReLU network with vector output is precisely a linear learner over the features [D 1 X, • • • D P X], where convex constraints and group nuclear norm regularization P i=1 V i * is applied to the linear model weights. In the case of the CNNs, the piecewise low-rank model is over the smaller dimensional patch matrices {X k } K k=1 , which result in significantly fewer hyperplane arrangements, and therefore, fewer local low-rank models.

3.2. PROVABLY SOLVING THE NEURAL NETWORK TRAINING PROBLEM

In this section, we present a procedure for minimizing the convex program as presented in ( 12) for general output dimension c. This procedure relies on Algorithm 5 for cone-constrained PCA from (Asteris et al., 2014) , and the Frank-Wolfe algorithm for constrained convex optimization (Frank et al., 1956) . Unlike SGD, which is a heuristic method applied to a non-convex training problem, this approach is built upon results of convex optimization and provably finds the global minimum of the objective. In particular, we can solve the problem in epigraph form, p * = min t≥0 min Vi∈Ki ∀i∈[P ] P i=1 Vi * ≤t 1 2 P i=1 D i XV i -Y 2 F + βt where we can perform bisection over t in an outer loop to determine the overall optimal value of ( 12). Then, we have the following algorithm to solve the inner minimization problem of ( 14): Algorithm 1: 1. Initialize {V (0) i } P i=1 = 0.

2.. For steps k:

(a) For each i ∈ [P ] solve the following subproblem: s (k) i = max u 2≤1 g 2≤1 (2Di-I)Xu≥0 D i Xug , Y - P j=1 D j XV (k) j And define the pairs {(u i , g i )} P i=1 to be the argmaxes of the above subproblems. This is is a form of semi-nonnegative matrix factorization (semi-NMF) on the residual at step k (Ding et al., 2008) . It can be solved via cone-constrained PCA in O(n r ) time where r = rank(X). (b) For the semi-NMF factorization obtaining the largest objective value, i * := arg max i s (k) i , form M (k) i * = u i * g i * . For all other i = i * , simply let M (k) i = 0. (c) For step size α (k) ∈ (0, 1), update V (k+1) i = (1 -α (k) )V (k) i + tα (k) M (k) i The derivations for the method and complexity of Algorithm 1 are found in Appendix A.4. We have thus described a Frank-Wolfe algorithm which provably minimizes the convex dual problem, where each step requires a semi-NMF operation, which can be performed in O(n r ) time.

3.3. SPIKE-FREE DATA MATRICES AND CLOSED-FORM SOLUTIONS

As discussed in Section 2, if X is spike-free, the set of sign partitions is reduced to the single partition D 1 = I. Then, the convex program (12) becomes min V ∈conv{ug :Xu≥0, g 2≤1} 1 2 XV -Y 2 F + β V * (15) This problem can also be solved with Algorithm 1. However, the asymptotic complexity of this algorithm is unchanged, due to the cone-constrained PCA step. If the constraint on V were removed, (15) would be identical to optimizing a linear-activation network. However, additional cone constraint on V allows for a more complex representation, which demonstrates that even in the spike-free case, a ReLU-activation network is quite different from a linear-activation network. Recalling that whitened data matrices where n ≤ d are spike-free, for a further simplified class of data and label matrices, we can find a closed-form expression for the optimal weights. Theorem 2 Consider a whitened data matrix X ∈ R n×d where n ≤ d, and labels Y with SVD of X Y = c i=1 σ i a i b i . If the left-singular vectors of X Y satisfy Xa i ≥ 0 ∀i ∈ {i : σ i > β}, there exists a closed-form solution for the optimal V * to problem (15), given by V * = c i=1 (σ i -β) + a i b i (16) The resulting model is a soft-thresholded SVD of X Y , which arises as the solution of maximummargin matrix factorization (Srebro et al., 2005) . The scenario that all the left singular vectors of X Y satisfy the affine constraints Xa i ≥ 0 ∀i occurs when the all of the left singular vectors Y are nonnegative, which is the case for example when Y is a one-hot-encoded matrix. In this scenario where the left-singular vectors of X Y satisfy Xa i ≥ 0 ∀i ∈ {i : σ i > β}, we note that the ReLU constraint on V * is not active, and therefore, the solution of the ReLU-activation network training problem is identical to that of the linear-activation network. This linear-activation setting has been well-studied, such as in matrix factorization models by (Cabral et al., 2013; Li et al., 2017) , and in the context of implicit bias of dropout (Mianjy et al., 2018; Mianjy & Arora, 2019) . This theorem thus provides a setting in which ReLU-activation and linear-activation networks perform identically.

4.1. AN EQUIVALENT COPOSITIVE PROGRAM

We now present an alternative representation of the neural network training problem with squared loss, which has ties to copositive programming. Theorem 3 For all β > 0, the neural network training problem (5) has a convex strong dual, given by p * = min Ui∈Ci ∀i∈[P ] 1 2 tr Y I + 2 P i=1 (D i X)U i (D i X) -1 Y + β 2 P i=1 tr(U i ) (17) for convex sets C i , given by C i := conv{uu : (2D i -I)Xu ≥ 0} This is a minimization problem with a convex objective over P sets of convex, completely positive cones-a copositive program, which is NP-hard. There exists a cutting plane algorithm solves this problem in O(n r ), which is polynomial in n for data matrices of rank r (see Appendix A.5) . This formulation provides a framework for viewing ReLU neural networks as implicit copositive programs, and we can find conditions during which certain relaxations can provide optimal solutions.

4.2. A COPOSITIVE RELAXATION

We consider the copositive relaxation of the sets C i from (17). We denote this set Ci := {U : U 0, (2D i -I)XU X (2D i -I) ≥ 0} In general, C i ⊆ Ci , with equality when d ≤ 4 (Kogan & Berman, 1993; Dickinson, 2013) . We define the relaxed program as d * cp := min Ui∈ Ci ∀i∈[P ] 1 2 tr Y I + 2 P i=1 (D i X)U i (D i X) -1 Y + β 2 P i=1 tr(U i ) Because of the enumeration over sign-patterns, this relaxed program still has a complexity of O(n r ) to solve, and thus does not improve upon the asymptotic complexity presented in Section 3.

4.3. SPIKE-FREE DATA MATRICES

If X is restricted to be spike-free, the convex program (19) becomes d * cp := min U 0 XU X ≥0 1 2 tr Y I + 2XU X -1 Y + β 2 tr(U ) With spike-free data matrices, the copositive relaxation presents a heuristic algorithm for the neural network training problem which is polynomial in both n and d. This contrasts with the exact formulations of ( 12) and ( 17), for which the neural network training problem is exponential even for a spike-free X. Scalar-output Vector-output (exact) Vector-output (relaxation) Spike-free X O(nd 2 ) O(n r ) O(n 2 d 4 ) General X O(( n d ) 3r ) O(n r ( n d ) 3r ) O(( n d ) 3r )

5. EXPERIMENTS

5.1 DOES SGD ALWAYS FIND THE GLOBAL OPTIMUM FOR NEURAL NETWORKS? While SGD applied to the non-convex neural network training objective is a heuristic which works quite well in many cases, there may exist pathological cases where SGD fails to find the global minimum. Using Algorithm 1, we can now verify whether SGD find the global optimum. In this experiment, we present one such case where SGD has trouble finding the optimal solution in certain circumstances. In particular, we generate random inputs X ∈ R 25×2 , where the elements of X are drawn from an i.i.d standard Gaussian distribution: X i,j ∼ N (0, 1). We then randomly initialize a data-generator neural network f with 100 hidden neurons and and an output dimension of 5, and generate labels Y = f (X) ∈ R 25×5 using this model. We then attempt to fit these labels using a neural network and squared loss, with β = 10 -2 . We compare the results of training this network for 5 trials with 10 and 50 neurons to the global optimum found by Algorithm 1. In this circumstance, with 10 neurons, none of the realizations of SGD converge to the global optimum as found by Algorithm 1, but with 50 neurons, the loss is nearly identical to that found by Algorithm 1. 

5.2. MAXIMUM-MARGIN MATRIX FACTORIZATION

In this section, we evaluate the performance of the soft-thresholded SVD closed-form solution presented in Theorem 2. In order to evaluate this method, we take a subset of 3000 points from the CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009) . For each dataset, we first de-mean the data matrix X ∈ R 3000×3072 , then whiten the data-matrix using ZCA whitening. We seek to fit one-hot-encoded labels representing the class labels from these datasets. In Fig. 2 , we observe that the soft-thresholded SVD method from Theorem 2 finds the same solution as SGD in far shorter time. Appendix A.1.4 contains further details of this experiment.

5.3. EFFECTIVENESS OF THE COPOSITIVE PROGRAM

In this section, we compare the objective values obtained by SGD, Algorithm 1, and the copositive program defined in (17). We use an artificially-generated spiral dataset, with X ∈ R 60×2 and 3 classes (see Fig. 3 (a) for an illustration). In this case, since d ≤ 4, we note that the copositive relaxation in ( 19) is tight. Across different values of β, we compare the solutions found by these three methods. As shown in Fig. 3 , the copositive relaxation, the solution found by SGD, and the solution found by Algorithm 1 all coincide with the same loss across various values of β. This verifies our theoretical proofs of equivalence of ( 5), ( 12), and ( 19). 

6. CONCLUSION

We studied the vector-output ReLU neural network training problem, and designed the first algorithms for finding the global optimum of this problem, which are polynomial-time in the number of samples for a fixed data rank. We found novel connections between this vector-output ReLU neural network problem and a variety of other problems, including semi-NMF, cone-constrained PCA, soft-thresholded SVD, and copositive programming. Of particular interest is extending these results to deeper networks, which would further explain the performance of neural networks as they are often used in practice. One such method to extend the results in this paper to deeper networks is to greedily train and stack two-layer networks to create one deeper network, which has shown to mimic the performance of deep networks trained end-to-end. Some preliminary results for convex program equivalents of deeper training problems are presented under whitened input data assumptions in (Ergen & Pilanci, 2020c). Another interesting research direction is investigating efficient relaxations of our vector output convex programs for larger scale simulations, which have been studied in (Bartan & Pilanci, 2019; Ergen & Pilanci, 2019b; a; d'Aspremont & Pilanci, 2020) . Furthermore, landscapes of vector output neural networks and dynamics of gradient descent type methods can be analyzed by leveraging our results. In (Lacotte & Pilanci, 2020) , an analysis of the landscape for scalar output networks based on the convex formulation was given which establishes a direct mapping between the non-convex and convex objective landscapes. Finally, our copositive programming and semi-NMF representations of ReLU networks can be used to develop more interpretable neural models. An investigation of scalar output convex neural models for neural image reconstruction was given in (Sahiner et al., 2020) .

A APPENDIX

A.1 ADDITIONAL EXPERIMENTAL DETAILS All neural networks in the experiments were trained using the Pytorch deep learning library (Paszke et al., 2019) , using a single NVIDIA GeForce GTX 1080 Ti GPU. Algorithm 1 was trained using a CPU with 256 GB of RAM, as was the maximum-margin matrix factorization. Unless otherwise stated, the Frank-Wolfe method from Algorithm 1 used a step size of α (k) = 2 2+k , and all methods were trained to minimize squared loss. Unless otherwise stated, the neural networks were trained until full training loss convergence with SGD with a momentum parameter of 0.95 and a batch size the size of the training set (i.e. full-batch gradient descent), and the learning rate was decremented by a factor of 2 whenever the training loss reached a plateau. The initial learning rate was set as high as possible without causing the training to diverge. All neural networks were initialized with Kaiming uniform initialization (He et al., 2015) .

A.1.1 ADDITIONAL EXPERIMENT: COPOSITIVE RELAXATION WHEN d ≥ 4

The copositive relaxation for the neural network training problem described in ( 19) is not guaranteed to exactly correspond to the objective when d ≥ 4. However, we find that in practice, this relaxation is tight even in such settings. To demonstrate such an instance, we consider the problem of generating images from noise. In particular, we initialize X element-wise from an i.i.d standard Gaussian distribution. To analyze the spike-free setting, we whitened X using ZCA whitening. Then, we attempted to fit images Y from the MNIST handwritten digits dataset (LeCun et al., 1998) and CIFAR-10 dataset (Krizhevsky et al., 2009) respectively. From each dataset, we select 100 random images with 10 samples from each class and flatten them into vectors, to form Y M N IST ∈ Y 100×784 and Y CIF AR ∈ Y 100×3072 . We allow the noise inputs to have the same shape as the output. Clearly, in these cases, with d = 784 and d = 3072 respectively, the copositive relaxation ( 20) is not guaranteed to correspond exactly to the neural network training optimum. However, we find across a variety of regularization parameters β, that the solution found by SGD and this copositive relaxation exactly correspond, as demonstrated in Figure 4 . While for the lowest value of β, the copositive relaxation does not exactly correspond with the value obtained by SGD, we note that we showed the objective value of the copositive relaxation to be a lower bound of the neural network training objective-meaning that the differences seen in this plot are likely due to a numerical optimization issue, rather than a fundamental one. Non-convex SGD was trained for 60,000 epochs with 1000 neurons with a learning rate of 5 × 10 -5 , while the copositive relaxation was trained using Adam (Kingma & Ba, 2014) with the Geotorch library for constrained optimization and manifold optimization for deep learning in PyTorch, which allowed us to express the PSD constraint, with an additional hinge loss to penalize the violations of the affine constraints. This copositive relaxation was trained for 60,000 epochs with a learning rate of 10 -2 for CIFAR-10 and 4 × 10 -2 for MNIST, and β 1 = 0.9, β 2 = 0.999 and = 10 -8 as parameters for Adam. A . We note that for β ≥ 1.0, the optimal solution for both networks is to simply set all weights to zero. The best-case test loss for the ReLU network is nearly half the best-case test loss for the linear network. As discussed in Section 3.3., if the data matrix X is spike-free, the resulting convex ReLU model ( 15) is similar to a linear-activation network, with the only difference being an additional cone constraint on the weight matrix V . It stands to wonder whether in the case of spike-free data matrices, the use of a ReLU network is necessary at all, and whether a linear-activation network would perform equally well. In this experiment, we compare the performance of a ReLU-activation network to a linearactivation one, and demonstrate that even in the spike-free case, there exist instances in which the ReLU-activation network would be preferred. In particular, we take as our training data 3000 demeaned and ZCA-whitened images from the CIFAR-10 dataset to form our spike-free training data X ∈ R 3000×3072 . We then generate continuous labels Y ∈ R 3000×10 from a randomly-initialized ReLU two-layer network with 4000 hidden units. We use this same label-generating neural network to generate labels for images from the full 10,000-sample test set of CIFAR-10 as well, after the test images are pre-processed with the same whitening transformation used on the training data. Across different values of β, we measured the training and generalization performance of both ReLU-activation and linear-activation two-layer neural networks trained with SGD on this dataset. Both networks used 4000 hidden units, and were trained for 400 epochs with a learning rate of 10 -2 and momentum of 0.95. Our results are displayed in Figure 5 . As we can see, for all values of β, while the linear-activation network has equal or lesser training loss than the ReLU-activation network, the ReLU-activation network generalizes significantly better, achieving orders of magnitude better test loss. We should note that for values of β = 1.0 and above, both networks learn the zero network (i.e. all weights at optimum are zero), so both their training and test loss are identical to each other. We can also observe that the best-case test loss for the linear-activation network is to simply learn the zero network, whereas for a value of β = 10 -2 the ReLU-activation network can learn to generalize better than the zero network (achieving a test loss of 63038, compared to a test loss of 125383 of the zero-network). These results demonstrate that even for spike-free data matrices, there are reasons to prefer a ReLU-activation network to a linear-activation network. In particular, because of the cone-constraint on the dual weights V , the ReLU network is induced to learn a more complex representation than the linear network, which would explain its better generalization performance. The CIFAR-10 dataset consists of 50,000 training images and 10,000 test images of 32 × 32 for 3 RGB channels, with 10 classes (Krizhevsky et al., 2009) . These images were normalized by the per-channel training set mean and standard deviation. To form our training set, selected 3,000 training images from these datasets at random, where each class was equally represented. This data was then feature-wise de-meaned and transformed using ZCA. This same training class mean and ZCA transformation was also then used on the 10,000 testing points for evaluation. A.1.3 DOES SGD ALWAYS FIND THE GLOBAL OPTIMUM FOR NEURAL NETWORKS? For these experiments, SGD was trained with an initial learning rate of 4 × 10 -5 for 20,000 epochs. We used a regularization penalty value of β = 10 -2 . The value for t for Algorithm 1 was found by first starting at the value of regularization penalty 1 2 m j=1 u j 2 2 + v j 2 2 from the solution from SGD, then refining this value using manual tuning. A final value of t = 1.495 was chosen. For this experiment, there were P = 50 sign patterns. Algorithm 1 was run for 30,000 iterations, and took X seconds to solve.

A.1.4 MAXIMUM-MARGIN MATRIX FACTORIZATION

The CIFAR-10 and CIFAR-100 datasets consist of 50,000 training images and 10,000 test images of 32 × 32 for 3 RGB channels, with 10 and 100 classes respectively (Krizhevsky et al., 2009) . These images were normalized by the per-channel training set mean and standard deviation. To form our training set, selected 3,000 training images from these datasets at random, where each class was equally represented. This data was then feature-wise de-meaned and transformed using ZCA. This same training class mean and ZCA transformation was also then used on the 10,000 testing points for evaluation. For CIFAR-10, we used a regularization parameter value of β = 1.0, whereas for CIFAR-100, we used a value of β = 5.0. SGD was trained for 400 epochs with a learning rate of 10 -2 with 1000 neurons, trained with one-hot encoded labels and squared loss. Figure 6 displays the test accuracy of the learned networks. Surprisingly the whitened classification from only 3,000 images generalizes quite well in both circumstances, far exceeding performance of the null classifier. For the CIFAR-10 experiments, the algorithm from Theorem 2 took only 0.018 seconds to solve, whereas for CIFAR-100 it took 0.36 seconds to solve. For this classification problem, we use one-hot encoded labels and squared loss. For β < 1, SGD used a learning rate of 10 -3 , and otherwise used a learning rate of 2 × 10 -3 . SGD was trained for 8,000 epochs with 1000 neurons, while Algorithm 1 ran for 1,000 iterations. The copositive relaxation was optimized with CVXPY with a first-order solver on a CPU with 256 GB of RAM (Diamond & Boyd, 2016) . The first-order convex solver for the copositive relaxation used a maximum of 20,000 iterations. This dataset had P = 114 sign patterns. The value of t for Algorithm 1 was chosen as the regularization penalty β  (Xu j ) + v j -Y 2 F + β 2 m j=1 u j 2 2 + v j 2 2 Let X = U DV be the compact SVD of X with rank r, where U ∈ R n×r , D ∈ R r×r and V ∈ R d×r . Let u j = V u j and u ⊥ j = V ⊥ u j . We note that Xu j = XV u j and u j 2 2 = u j (V V + V ⊥ V ⊥ )u j = u j 2 2 + u ⊥ j 2 2 . Then, we can re-parameterize the problem as min u ⊥ j ,u j ,vj 1 2 m j=1 (XV u j ) + v j -Y 2 F + β 2 m j=1 u j 2 2 + u ⊥ j 2 2 + v j 2 2 We note that u ⊥ j only appears in the regularization term. Minimizing over u ⊥ j thus means simply setting it to 0. Then, we have min u j ,vj 1 2 m j=1 (XV u j ) + v j -Y 2 F + β 2 m j=1 u j 2 2 + v j 2 2 We note that XV = U D ∈ R n×r and u j ∈ R r . Thus, for X of rank r, we can effectively reduce the dimension of the neural network training problem without loss of generality. This thus holds for all results concerning the complexity of the neural network training problem with data matrices of a fixed rank.

A.3.1 PROOF OF THEOREM 1

We begin with the primal problem (5), repeated here for convenience: p * = min uj ∈R d vj ∈R c 1 2 m j=1 (Xu j ) + v j -Y 2 F + β 2 m j=1 u j 2 2 + v j 2 2 We start by re-scaling the weights in order to obtain a slightly different, equivalent objective, which has been performed previously in (Pilanci & Ergen, 2020; Savarese et al., 2019) .

Lemma 4

The primal problem is equivalent to the following optimization problem p * = min uj 2≤1 min vj ∈R c 1 2 m j=1 (Xu j ) + v j -Y 2 F + β m j=1 v j 2 Proof: Note that for any γ > 0, we can re-scale the parameters ūj = γ j u j , vj = v j /γ j . Noting that the network output is unchanged by this re-scaling scheme, we have the equivalent problem p * = min uj ∈R d vj ∈R c min γj >0 1 2 m j=1 (Xu j ) + v j -Y 2 F + β 2 m j=1 γ 2 j u j 2 2 + v j 2 2 /γ 2 j ( ) Minimizing with respect to γ j , we thus end up with p * = min uj ∈R d vj ∈R c 1 2 m j=1 (Xu j ) + v j -Y 2 F + β 2 m j=1 u j 2 v j 2 We can thus set u j 2 = 1 without loss of generality. Further, relaxing this constraint to u j 2 ≤ 1 does not change the optimal solution. In particular, for the problem min uj 2≤1 min vj ∈R c 1 2 m j=1 (Xu j ) + v j -Y 2 F + β m j=1 v j 2 (28) the constraint u j 2 = 1 will be active for all non-zero v j . Thus, relaxing the constraint will not change the objective. This proves the Lemma. Now, we are ready to prove the first part of Theorem 1, i.e. the equivalence to the semi-infinite program (11). Lemma 5 For all β > 0 primal neural network training problem (25) has a strong dual, in the form of p * = d * := max Z: Z (Xu)+ 2≤β ∀ u 2 ≤1 - 1 2 Z -Y 2 F + 1 2 Y 2 F ( ) Proof: We first form the Lagrangian of the primal problem, by first-reparameterizing the problem as min uj 2≤1 min vj ,R 1 2 R 2 F + β m j=1 v j 2 s.t. R = m j=1 (Xu j ) + v j -Y and then forming the Lagrangian as min uj 2 ≤1 min vj ,R max Z 1 2 R 2 F + β m j=1 v j 2 + Z Y + Z R -Z m j=1 (Xu j ) + v j By Sion's minimax theorem, we can switch the inner maximum and minimum, and minimize over v j and R. This produces the following problem: p * = min uj 2≤1 max Z: Z (Xu)+ 2 ≤β - 1 2 Z -Y 2 F + 1 2 Y 2 F ( ) We then simply need to interchange max and min to obtain the desired form. Note that this interchange does not change the objective value due to semi-infinite strong duality. In particular, for any β > 0, this problem is strictly feasible (simply let Z = 0) and the objective value is bounded by 1 2 Y 2 F . Then, by Theorem 2.2 of (Shapiro, 2009) , we know that strong duality holds, and p * = d * := max Z: Z (Xu)+ 2≤β ∀ u 2 ≤1 - 1 2 Z -Y 2 F + 1 2 Y 2 F ( ) as desired. Furthermore, by (Shapiro, 2009) , for a signed measure µ, we obtain the following strong dual of the dual program (11): d * = max µ 0 min Z∈R n×d - 1 2 Z -Y 2 F - 1 2 Y 2 F + B2 Z (Xu) + 2 -β dµ(u) where B 2 defines the unit 2 -ball. By discretization arguments in Section 3 of (Shapiro, 2009) , and by Helly's theorem, there exists some m * ≤ nc + 1 such that this is equivalent to d * = max µ≥0 min Z∈R n×d , ui 2≤1 - 1 2 Z -Y 2 F - 1 2 Y 2 F + m * i=1 Z (Xu i ) + 2 -β µ i Minimizing with respect to Z, we obtain min ui 2≤1 min µ≥0 min gi 2 ≤1 1 2 m * i=1 µ i (Xu i ) + g i -Y 2 F + β m * i=1 µ i which we can minimize with respect to µ to obtain the finite parameterization d * = min ui: ui 2≤1 min gi 1 2 m * i=1 (Xu i ) + g i -Y 2 F + β m * i=1 g i 2 (37) This proves that the semi-infinite dual provides a finite support with at most m * ≤ nc + 1 non-zero neurons. Thus, if the number of neurons of the primal problem m ≥ m * , strong duality holds. Now, we seek to show the second part of Theorem 1, namely the equivalence to (12). Starting from (11), we have that the dual constraint is given by max u 2≤1 Z (Xu) + 2 ≤ β Using the concept of dual norm, we can introduce variable g to further re-express this constraint as max u 2≤1 g 2≤1 g Z (Xu) + ≤ β Then, enumerating over sign patterns {D i P i=1 , we have max i∈[P ] max u 2≤1 g 2 ≤1 (2Di-I)Xu≥0 g Z D i Xu ≤ β Now, we express this in terms of an inner product. max i∈[P ] (2Di-I)Xu≥0 u 2≤1 g 2≤1 Z, D i Xug ≤ β Letting V = ug : max i∈[P ] V =ug (2Di-I)Xu≥0 V * ≤1 Z, D i XV ≤ β Now, we can take the convex hull of the constraint set, noting that since the objective is affine, this does not change the objective value. max i∈[P ] V ∈Ki V * ≤1 Z, D i XV ≤ β Thus, the dual problem is given by p * = d * = max Z - 1 2 Z -Y 2 F + 1 2 Y 2 F s.t. max Vi∈Ki Vi * ≤1 Z, D i XV ≤ β ∀i ∈ [P ] We now form the Lagrangian, p * = d * = max Z min λ≥0 min Vi∈Ki ∀i∈[P ] Vi * ≤1 - 1 2 Z -Y 2 F + 1 2 Y 2 F + P i=1 λ i β -Z, D i XV i We note that by Sion's minimax theorem, we can switch the max and min, and then minimize over Z. Following this, we obtain p * = d * = min Vi∈Ki ∀i∈[P ] Vi * ≤1 min λ≥0 1 2 P i=1 λ i D i XV i -Y 2 F + β P i=1 λ i This is simply by noting that left-multiplying by X does not change the norm. Now, note that min V ∈conv{ug :Xu≥0, g 2≤1} 1 2 V -X Y 2 F +β V * ≥ min V ∈conv{ug : g 2≤1} 1 2 V -X Y 2 F +β V * The relaxed problem on the right-hand side is given exactly by maximum-margin matrix factorization (Srebro et al., 2005) , i.e. can be expressed as min V = c i=1 uig i 1 2 V -X Y 2 F + β V * Furthermore, this has a closed-form solution, given by soft-thresholding the singular values of X Y (Hastie et al., 2015) . Thus, we have the solution of the relaxed problem as V * = c i=1 (σ i -β) + a i b i Noting that Xa i ≥ 0 for all i ∈ {i : σ i > β i } by assumption, this solution is feasible for the original problem. Thus, because the optimal value obtained by the solution to the relaxed problem was a lower bound to (15), it must be the optimal solution to (15). Remark 5.2 It is interesting to note that when Y is one-hot encoded, X Y = n (1) µ (1) n (2) µ (2) • • • n (c) µ (c) ) where n (i) refers to the number of instances in class i, and µ (i) refers to the mean of all data points belonging to class i. In this scenario, then, the optimal neural network weights, V * , are found via the maximum-margin factorization of this matrix. Further, if Y Y is element-wise non-negative, by Perron-Frobenius theorem, its maximal left singular is non-negative. For such data matrices, if β is chosen to be larger than the second largest singular value of Y , the solution in ( 16) is guaranteed to be exact.

A.3.4 PROOF OF THEOREM 3

Start with (11): p * = d * = max Z - 1 2 Z -Y 2 F + 1 2 Y 2 F s.t Z (Xu) + 2 ≤ β ∀ u 2 ≤ 1 We can express this as p * = d * = max Z - 1 2 Z -Y 2 F + 1 2 Y 2 F s.t max u 2≤1 Z (Xu) + 2 2 ≤ β 2 Enumerating over all possible sign patterns i ∈ [P ], and noting that Z D i Xu 2 2 = tr Z D i Xuu X D i Z , we have p * = d * = max Z - 1 2 Z -Y 2 F + 1 2 Y 2 F s.t max i∈[P ] (2Di-I)Xu≥0 u 2≤1 tr Z D i Xuu X D i Z ≤ β 2 (56) Noting the maximization constraint is linear in uu , we can take the convex hull of the constraint set and not change the optimal value. Thus, p * = d * = max Z - 1 2 Z -Y 2 F + 1 2 Y 2 F s.t max i∈[P ] U ∈Ci tr(U )≤1 tr Z D i XU X D i Z ≤ β 2 (57) Which is then equivalent to p * = d * = max Z - 1 2 Z -Y 2 F + 1 2 Y 2 F s.t max Ui∈Ci tr(Ui)≤1 tr Z D i XU i X D i Z ≤ β 2 ∀i ∈ [P ] We thus have a problem with a convex objective and P constraints, so we can take the Lagrangian p * = d * = max Z min Ui∈Ci ∀i∈[P ] tr(Ui)≤1 λ≥0 - 1 2 Z -Y 2 F + 1 2 Y 2 F + P i=1 λ i β 2 -tr Z D i XU i X D i Z Noting that this function is convex over Z, affine over λ, and concave over U i , and the constraint set is convex, by Sion's minimax theorem we can change max and min without changing the objective. Thus, In general, the Frank-Wolfe algorithm (Frank et al., 1956) Note that the objective value in the optimization problem of the LMO step is affine with respect to the variables S i . Thus, the optimal value occurs at an extreme point of the feasible set. Thus, only one of {M i } P i=1 is active at optimum at index i = i * , and moreover this M i * occurs at an extreme point of K i * , i.e M i * ∈ Ki * . All other M i where i = i * are zero at optimum. Thus, we can re-write this problem as p * = d * = min Ui∈Ci ∀i∈[P ] tr(Ui)≤1 λ≥0 max Z - 1 2 Z -Y 2 F + 1 2 Y 2 F + P i=1 λ i β 2 -tr Z D i XU i X D i Z M i * = arg min i∈[P ] S∈ Ki S * ≤t D i XS, P j=1 D j XV (k) j -Y Recalling that K i := conv{ug : (2D i -I)Xu ≥ 0, g 2 ≤ 1} boundary of K i is simple to express, leaving us with M i * = u i * g i * , where (u i * , g i * ) = arg min i∈[P ] (2Di-I)Xu≥0 g 2≤1 u 2≤t D i Xug , P j=1 D j XV (k) j -Y Note that we can change the constraint u 2 ≤ t to u 2 ≤ 1 and simply multiply by a factor of t to the solution afterwards. We can thus write the key Frank-Wolfe LMO step subproblem as (u i * , g i * ) = max and store the arg maxes (u i , g i ) for each i. Then, from the index which attains the maximum i * := arg max i s (k) i , form M (k) i * = u i * g i * . For all other i = i * , as stated previously, M (k) i = 0. Then, we must re-multiply by the factor t which was removed in (71) to obtain the update rule for all i ∈ [P ]:  V (k+1) i = (1 -α (k) )V (k) i + tα (k) M (k) i A.



Figure 1: As the number of neurons increases, the solution of SGD approaches the optimal value.

Figure 2: The maximum-margin SVD from Theorem 2 provides the closed-form solution for the optimal value of the neural network training problem for whitened CIFAR-10 and CIFAR-100.

Figure 3: Spiral classification: SGD (1000 neurons), Algorithm 1, and copositive relaxation (19).

Figure 4: The copositive relaxation (20) and solution found by SGD nearly correspond for almost all β.

Figure5: Comparing train and test accuracy of linear and ReLU two-layer networks on generated continuous labels on CIFAR-10. We note that for β ≥ 1.0, the optimal solution for both networks is to simply set all weights to zero. The best-case test loss for the ReLU network is nearly half the best-case test loss for the linear network.

Figure 6: Test accuracy for maximum-margin matrix factorization and SGD for whitened CIFAR-10 and CIFAR-100.

60) Solving for Z and re-substituting, we thus have p * = d * = min Ui∈Ci trproof of Theorem 1, we easily re-scale this to obtain the desired objective: p * = d * = min Ui∈Ci ∀i∈[P ] ON THE FRANK-WOLFE METHOD APPLIED TO (12) A.4.1 DERIVATION OF THE FRANK-WOLFE ALGORITHM

aims to solve the problem min x f (x) s.t. x ∈ D (63) for some convex function f and convex set D. It does so in the following iterative steps k = 1, • • • , K with step sizes α (k) and an initial point x (1)1. (LMO step) Solve the problemm (k) := arg min s∈D s, ∇ x f (x (k) )(64)2. (Update step) Update the decision variablex (k+1) = (1 -α (k) )x (k) + α (k) m (k)(65)We will now apply the Frank-Wolfe problem to the inner minimization of (14), i.e., for a fixed value of t, solving minVi∈Ki ∀i∈[P ]

For each i ∈ [P ], we thus solve the problem s

4.2 SOLVING EACH FRANK-WOLFE ITERATEWe discuss here how to solve the subproblem s

Table1summarizes the complexities of the neural network training problem. Complexity of global optimization for two-layer ReLU networks with scalar and vector outputs. Best known upper-bounds are shown where n is the number of samples, d is the dimension of the samples and r is the rank of the training data. Note that for convolutional networks, r is the size of a single filter, e.g., a convolutional layer with a kernel size of 3 × 3 corresponds to r = 9.

ACKNOWLEDGEMENTS

This work was partially supported by the National Science Foundation under grants IIS-1838179 and ECCS-2037304, the National Institutes of Health under grants R01EB009690 and R01EB0026136, Facebook Research, Adobe Research and Stanford SystemX Alliance.

annex

And now minimize over λ to obtain the desired result:Remark 5.1 Given the optimal solution (12), it is natural to wonder how these dual variables relate to the optimal neurons of the neural network training problem (5). Given an optimal {V * i } P i=1 , we simply need to factor them intowhere (2D i -I)Xh * ij ≥ 0. This is similar in flavor to semi-NMF. Since V * i is in the cone K i , exact semi-NMF factorization can be performed in polynomial time (Gillis & Kumar, 2015) . Once this factorization is obtained, assuming without loss of generality that g * ij 2 = 1, the optimal neurons are given byThus, given a solution to (12), a polynomial-time algorithm exists for reconstructing weights of the original neural network training algorithm.

A.3.2 A COROLLARY FOR CNNS WITH AVERAGE POOLING

We first introduce additional notation for CNNs with average pooling. In particular, following Ergen & Pilanci (2020d), we use the set of patch matrices {X k } K k=1 to define a new data matrix aswhere h is the convolutional filter size. This matrix thus has a set of sign patterns, which we define aswhere r c = rank(M ) ≤ h. We then can enumerate the set of patch matricesCorollary 5.1 In the case of a two-layer CNN with global average pooling as in ( 6), the strong dual of the neural network training problemfor convex sets K i , given byThis strong dual is convex. For a fixed kernel-size, K and r c are fixed, and the problem is polynomial in n, i.e. of complexity proportional O(n rc ). Since in practice, M is a tall matrix with relatively few columns, it is almost always full-rank, in which case the computational complexity of solving the strong dual is O(n h ). This problem can be solved with the same Frank-Wolfe algorithm as presented in Algorithm 1.

A.3.3 PROOF OF THEOREM 2

We note that whitened data matrices such that n ≤ d satisfy XX = I. Then, we can solve the modified problem minusing cone-constrained PCA. In particular, note that for a fixed u, we can solve for g in closed form.Let. Then, we haveThen, re-substituting this back into the objective, we haveWithout loss of generality, we can square the objective, since objective only takes on non-negative values. The LMO step of the Frank-Wolfe algorithm is thus identical to the cone-constrained PCA problemThis problem, via Lemma 7 of Asteris et al. (2014) , can be performed in O(n d ) time. However, we note that because we can reduce the dimension of the problem to r without loss of generality (see Appendix A.2), this thus simplifies to O(n r ). We perform P of such maximization problems per iteration of Frank-Wolfe.

A.5 A CUTTING PLANE METHOD FOR SOLVING (17)

Consider the dual problemEnumerating over sets of hyperplanes, we haveWe can express the dual constraint aswhereFor each i, this is a cone-constrained PCA problem in dimension d with n linear inequality constraints. Using Lemma 7 of (Asteris et al., 2014) , there exists an algorithm which solves the cone-constrained PCA problem in O(n r ) time. Thus, solving the optimization problem in the dual constraint ( 75) is O(P n r ) time. For a fixed rank r, P has order O(n r ), but otherwise P is O(n d ) as well. Thus, determining the feasibility of a dual variable Z from (75) has complexity O(n d ) in the general case, and O(n r ) in the rank-r case. Now, using the Analytic Center Cutting Plane Method (ACCPM), the cone-constrained PCA procedure above provides an oracle for the feasibility of a dual variable Z. Using this oracle, ACCPM solves the convex problem (73) in polynomial steps in terms of the dimension of Z, i.e. poly(nc) steps. Thus, the overall complexity of the cutting plane method is O(n d poly(nc)) in the general case, and O(n r poly(nc)) in the rank-r case, which is polynomial for a fixed r.We note that while this is similar in complexity to the results of (Pilanci & Ergen, 2020) , there are a few subtle differences. In particular, there are additional terms polynomial in c which do not appear in their work. However, the broader picture of the analyses align-for full-rank matrices, solving the convex dual of the neural network training problem is NP-hard, while for matrices of a fixed rank, a polynomial time algorithm exists.Published as a conference paper at ICLR 2021

A.6 EXTENSIONS TO GENERAL CONVEX LOSS FUNCTIONS

This follows closely from a similar discussion by Pilanci & Ergen (2020) . Consider the objective of the neural network training problem with a general convex loss function , given byThen, we can follow the same proof of Theorem 1 exactly, instead substituting the Fenchel dual of , defined asThen, we have the dual objective analog of (11):We can further elucidate the convex constraint on dual variable Z as done in the proof of Theorem 1. Then, we can follow the same steps from this previous proof, noting that by Fenchel-Moreau Theorem, * * = (Borwein & Lewis, 2010) . Thus, the finite-dimensional convex strong dual is given byas desired. We note that the Frank-Wolfe method in Algorithm 1 holds for general convex as well, with a small modification corresponding to the gradient of the loss function.

