SCALING CONVEX NEURAL NETWORKS WITH BURER-MONTEIRO FACTORIZATION

Abstract

Recently, it has been demonstrated that the training problem for a wide variety of (non) linear two-layer neural networks (such as two-layer perceptrons, convolutional networks, and self-attention) can be posed as equivalent convex optimization problems, with an induced regularizer which encourages low rank. However, this regularizer becomes prohibitively expensive to compute at moderate scales, impeding training convex neural networks. To this end, we propose applying the Burer-Monteiro factorization to convex neural networks, which for the first time enables a Burer-Monteiro perspective on neural networks with non-linearities. This factorization leads to an equivalent yet computationally tractable non-convex alternative with no spurious local minima. We develop a novel relative optimality bound of stationary points of the Burer-Monteiro factorization, thereby providing verifiable conditions under which any stationary point is a global optimum. Further, for the first time, we show that linear self-attention with sufficiently many heads has no spurious local minima. Our experiments demonstrate the implications of the relative optimality bound for stationary points of the Burer-Monteiro factorization.

1. INTRODUCTION

It has been demonstrated that the training problem for (non-linear) two-layer neural networks are equivalent to convex programs (Pilanci & Ergen, 2020; Ergen & Pilanci, 2020; Sahiner et al., 2021b; Ergen et al., 2021; Sahiner et al., 2021a) . This has been observed for a variety of architectures, including multi-layer perceptrons (MLPs) (Pilanci & Ergen, 2020; Sahiner et al., 2021b) , convolutional neural networks (CNNs) (Ergen & Pilanci, 2020; Sahiner et al., 2021c) , and self-attention based transformers (Sahiner et al., 2022) . A major benefit of convex training of neural networks is that global optimality is guaranteed, which brings transparency to training neural networks. The convex formulation of neural networks induces biases by regularization of the network weights. For linear activation, the convex model directly imposes nuclear-norm regularization which is wellknown to encourage low-rank solutions (Recht et al., 2010) . For ReLU activation, however, the convex model induces a type of nuclear norm which promotes sparse factorization while the left factor is constrained to an affine space (Sahiner et al., 2021b) . This constrained nuclear-norm is NP-hard to compute. This impedes the utility of convex neural networks for ReLU activation. To address this computational challenge, we seek a method which (i) inherits the per-iteration complexity of non-convex training of neural network, and (ii) inherits the optimality guarantees and transparency of convex training. To find a solution, we leverage the well-studied Burer-Monterio (BM) factorization (Burer & Monteiro, 2003) , which was originally proposed as a heuristic method to improve the complexity of convex semi-definite programs (SDPs). BM has been applied as an efficient solution strategy for problems ranging from matrix factorization (Zheng & Lafferty, 2016; Park et al., 2017; Ge et al., 2017; Gillis, 2017) to rank minimization (Mardani et al., 2013; Recht et al., 2010; Wang et al., 2017) and matrix completion (Mardani et al., 2015; Ge et al., 2017) . BM has also been used for over-simplified neural networks such as (Kawaguchi, 2016; Haeffele & Vidal, 2017; Du & Lee, 2018) , where optimality conditions for local minima are provided. However, no work has deployed BM factorization for practical non-linear neural networks, and no guarantees are available about the optimality of stationary points. This is likely because BM theory is not applicable to the standard non-convex ReLU networks due to non-linearity between layer weights.

