SCALING CONVEX NEURAL NETWORKS WITH BURER-MONTEIRO FACTORIZATION

Abstract

Recently, it has been demonstrated that the training problem for a wide variety of (non) linear two-layer neural networks (such as two-layer perceptrons, convolutional networks, and self-attention) can be posed as equivalent convex optimization problems, with an induced regularizer which encourages low rank. However, this regularizer becomes prohibitively expensive to compute at moderate scales, impeding training convex neural networks. To this end, we propose applying the Burer-Monteiro factorization to convex neural networks, which for the first time enables a Burer-Monteiro perspective on neural networks with non-linearities. This factorization leads to an equivalent yet computationally tractable non-convex alternative with no spurious local minima. We develop a novel relative optimality bound of stationary points of the Burer-Monteiro factorization, thereby providing verifiable conditions under which any stationary point is a global optimum. Further, for the first time, we show that linear self-attention with sufficiently many heads has no spurious local minima. Our experiments demonstrate the implications of the relative optimality bound for stationary points of the Burer-Monteiro factorization.

1. INTRODUCTION

It has been demonstrated that the training problem for (non-linear) two-layer neural networks are equivalent to convex programs (Pilanci & Ergen, 2020; Ergen & Pilanci, 2020; Sahiner et al., 2021b; Ergen et al., 2021; Sahiner et al., 2021a) . This has been observed for a variety of architectures, including multi-layer perceptrons (MLPs) (Pilanci & Ergen, 2020; Sahiner et al., 2021b) , convolutional neural networks (CNNs) (Ergen & Pilanci, 2020; Sahiner et al., 2021c) , and self-attention based transformers (Sahiner et al., 2022) . A major benefit of convex training of neural networks is that global optimality is guaranteed, which brings transparency to training neural networks. The convex formulation of neural networks induces biases by regularization of the network weights. For linear activation, the convex model directly imposes nuclear-norm regularization which is wellknown to encourage low-rank solutions (Recht et al., 2010) . For ReLU activation, however, the convex model induces a type of nuclear norm which promotes sparse factorization while the left factor is constrained to an affine space (Sahiner et al., 2021b) . This constrained nuclear-norm is NP-hard to compute. This impedes the utility of convex neural networks for ReLU activation. To address this computational challenge, we seek a method which (i) inherits the per-iteration complexity of non-convex training of neural network, and (ii) inherits the optimality guarantees and transparency of convex training. To find a solution, we leverage the well-studied Burer-Monterio (BM) factorization (Burer & Monteiro, 2003) , which was originally proposed as a heuristic method to improve the complexity of convex semi-definite programs (SDPs). BM has been applied as an efficient solution strategy for problems ranging from matrix factorization (Zheng & Lafferty, 2016; Park et al., 2017; Ge et al., 2017; Gillis, 2017) to rank minimization (Mardani et al., 2013; Recht et al., 2010; Wang et al., 2017) and matrix completion (Mardani et al., 2015; Ge et al., 2017) . BM has also been used for over-simplified neural networks such as (Kawaguchi, 2016; Haeffele & Vidal, 2017; Du & Lee, 2018) , where optimality conditions for local minima are provided. However, no work has deployed BM factorization for practical non-linear neural networks, and no guarantees are available about the optimality of stationary points. This is likely because BM theory is not applicable to the standard non-convex ReLU networks due to non-linearity between layer weights. Thus, our focus in this work is to adapt BM for practical two-layer (non-linear) convex neural networks. We consider three common architectures, namely MLPs, CNNs, and self-attention networks. For these scenarios, we develop verifiable relative optimality bounds for all local minima and stationary points, which are easy and interpretable. In light of these conditions, we identify useful insights about the nature of neural networks contributing to optimality. In particular, we observe that for self-attention networks all local minima coincide with the global optima if there are sufficiently many heads. The optimality guarantees also provide useful algorithmic insights, allowing one to verify whether the light-weight first-order methods such as SGD achieve the global optimum for the non-convex training of neural networks. Our experiments with image classification task indicate that this BM factorization enables layerwise training of convex CNNs, which allows for convex networks for the first time to match the performance of multi-layer end-to-end trained non-convex CNNs.

1.1. CONTRIBUTIONS

All in all, our contributions are summarized as follows: • We propose the BM factorization for efficiently solving convex neural networks with ReLU activation for moderate and large scales. This is the first time BM theory has been applied to the non-linear neural network setting. • We derive a novel bound on the relative optimality of the stationary points of the BM factorization for neural networks. • Accordingly, we identify simple and verifiable conditions which guarantee a stationary point of the non-convex BM formulation achieves the global optimum of the convex neural network. • We yield basic insights into the fundamental nature of neural networks that contribute to optimality; e.g. that linear self-attention has no spurious local minima if it has sufficiently many heads. • Our experiments verify the proposed relative optimality bound of stationary points from the BM factorization, and uncovers cases where SGD finds saddle points, even in two-layer neural networks.

1.2. RELATED WORK

Burer-Monteiro factorization. The Burer-Monteiro (BM) factorization was first introduced in (Burer & Monteiro, 2003; 2005) . There has been a long line of work studying the use of this factorization for solving SDPs (Boumal et al., 2016; Cifuentes & Moitra, 2019; Waldspurger & Waters, 2020; Erdogdu et al., 2021) . In the rectangular matrix case, gradient descent converges to a global optimum of the matrix factorization problem with high probability for certain classes of matrices (Zheng & Lafferty, 2016) . The BM factorization has been also studied in the rectangular case in more generic settings (Bach et al., 2008; Haeffele et al., 2014; Haeffele & Vidal, 2017) . Nuclear norm and rank minimization. The ability of nuclear norm regularization to induce low rank has been studied extensively in compressed sensing (Candès & Recht, 2009; Recht et al., 2010; Candès & Tao, 2010) . BM factorization has been applied to scale up nuclear-norm minimization (Mardani et al., 2015; 2013) . It has also been deployed for low-rank matrix factorization (Cabral et al., 2013; Zhu et al., 2017; Park et al., 2017; Ge et al., 2017) . The results show that all second-order critical points of the BM factorization are global optima if certain qualification conditions are met. SGD for non-convex neural networks. It has been shown that for over-parameterized two-layer linear networks, all local minima are global minima (Kawaguchi, 2016) . Accordingly, a line of work has attempted to show that gradient descent or its modifications provably find local minima and escape saddle points (Ge et al., 2015; Lee et al., 2016; Jin et al., 2017; Daneshmand et al., 2018) . However, these works assume Lipschitz gradients and Hessians of the non-convex objective, which is not typically satisfied. Another line of work shows that gradient descent converges to global optima for sufficiently highly over-parameterized neural networks, with either the parameter count being a high-order polynomial of the sample count (Du et al., 2018; 2019; Arora et al., 2019) , or the network architecture being simple (Du & Lee, 2018) . In practice, it has been empirically observed that SGD

