VECTOR-OUTPUT RELU NEURAL NETWORK PROB-LEMS ARE COPOSITIVE PROGRAMS: CONVEX ANAL-YSIS OF TWO LAYER NETWORKS AND POLYNOMIAL-TIME ALGORITHMS

Abstract

We describe the convex semi-infinite dual of the two-layer vector-output ReLU neural network training problem. This semi-infinite dual admits a finite dimensional representation, but its support is over a convex set which is difficult to characterize. In particular, we demonstrate that the non-convex neural network training problem is equivalent to a finite-dimensional convex copositive program. Our work is the first to identify this strong connection between the global optima of neural networks and those of copositive programs. We thus demonstrate how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and draw key insights from this formulation. We describe the first algorithms for provably finding the global minimum of the vector output neural network training problem, which are polynomial in the number of samples for a fixed data rank, yet exponential in the dimension. However, in the case of convolutional architectures, the computational complexity is exponential in only the filter size and polynomial in all other parameters. We describe the circumstances in which we can find the global optimum of this neural network training problem exactly with soft-thresholded SVD, and provide a copositive relaxation which is guaranteed to be exact for certain classes of problems, and which corresponds with the solution of Stochastic Gradient Descent in practice.

1. INTRODUCTION

In this paper, we analyze vector-output two-layer ReLU neural networks from an optimization perspective. These networks, while simple, are the building blocks of deep networks which have been found to perform tremendously well for a variety of tasks. We find that vector-output networks regularized with standard weight-decay have a convex semi-infinite strong dual-a convex program with infinitely many constraints. However, this strong dual has a finite parameterization, though expressing this parameterization is non-trivial. In particular, we find that expressing a vector-output neural network as a convex program requires taking the convex hull of completely positive matrices. Thus, we find an intimate, novel connection between neural network training and copositive programs, i.e. programs over the set of completely positive matrices (Anjos & Lasserre, 2011) . We describe algorithms which can be used to find the global minimum of the neural network training problem in polynomial time for data matrices of fixed rank, which holds for convolutional architectures. We also demonstrate under certain conditions that we can provably find the optimal solution to the neural network training problem using soft-thresholded Singular Value Decomposition (SVD). In the general case, we introduce a relaxation to parameterize the neural network training problem, which in practice we find to be tight in many circumstances.

1.1. RELATED WORK

Our analysis focuses on the optima of finite-width neural networks. This approach contrasts with certain approaches which have attempted to analyze infinite-width neural networks, such as the Neural Tangent Kernel (Jacot et al., 2018) . Despite advancements in this direction, infinite-width neural networks do not exactly correspond to their finite-width counterparts, and thus this method of analysis is insufficient for fully explaining their success (Arora et al., 2019) . Other works may attempt to optimize neural networks with assumptions on the data distribution. Of particular interest is (Ge et al., 2018) , which demonstrates that a polynomial number of samples generated from a planted neural network model is sufficient for extracting its parameters using tensor methods, assuming the inputs are drawn from a symmetric distribution. If the input distribution to a simple convolutional neural network with one filter is Gaussian, it has also been shown that gradient descent can find the global optimum in polynomial time (Brutzkus & Globerson, 2017) . In contrast to these works, we seek to find general principles for learning two-layer ReLU networks, regardless of the data distribution and without planted model assumptions. Another line of work aims to understand the success of neural networks via implicit regularization, which analyzes how models trained with Stochastic Gradient Descent (SGD) find solutions which generalize well, even without explicit control of the optimization objective (Gunasekar et al., 2017; Neyshabur et al., 2014) . In contrast, we consider the setting of explicit regularization, which is often used in practice in the form of weight-decay, which regularizes the sum of squared norms of the network weights with a single regularization parameter β, which can be critical for neural network performance (Golatkar et al., 2019) . Our approach of analyzing finite-width neural networks with a fixed training dataset has been explored for networks with a scalar output (Pilanci & Ergen, 2020; Ergen & Pilanci, 2020a; d) . In fact, our work here can be considered a generalization of these results. We consider a ReLU-activation two-layer network f : R d → R c with m neurons: f (x) = m j=1 (x u j ) + v j (1) where the function (•) + = max(0, •) denotes the ReLU activation, {u j ∈ R d } m j=1 are the first-layer weights of the network, and {v j ∈ R c } m j=1 are the second-layer weights. In the scalar-output case, the weights v j are scalars, i.e. c = 1. Pilanci & Ergen (2020) find that the neural network training problem in this setting corresponds to a finite-dimensional convex program. However, the setting of scalar-output networks is limited. In particular, this setting cannot account for tasks such as multi-class classification or multi-dimensional regression, which are some of the most common uses of neural networks. In contrast, the vector-output setting is quite general, and even greedily training and stacking such shallow vector-output networks can match or even exceed the performance of deeper networks on large datasets for classification tasks (Belilovsky et al., 2019) . We find that this important task of extending the scalar case to the vector-output case is an exceedingly non-trivial task, which generates novel insights. Thus, generalizing the results of Pilanci & Ergen ( 2020) is an important task for a more complete knowledge of the behavior of neural networks in practice. Certain works have also considered technical problems which arise in our analysis, though in application they are entirely different. Among these is analysis into cone-constrained PCA, as explored by Deshpande et al. ( 2014 2014) provide an exponential algorithm which runs in O(n d ) time to find the exact solution to (2), where X ∈ R n×d and R ∈ S d is a symmetric matrix. We leverage this result to show that the optimal value of the vector-output neural network training problem can be found in the worst case in exponential time with respect to r := rank(X), while in the case of a fixed-rank data matrix our algorithm is polynomial-time. In particular, convolutional networks with fixed filter sizes (e.g., 3 × 3 × m convolutional kernels) correspond to the fixed-rank data case (e.g., r = 9). In search of a polynomial-time approximation



) andAsteris et al. (2014). They consider the following optimization in general considered NP-hard. Asteris et al. (

