SCALING CONVEX NEURAL NETWORKS WITH BURER-MONTEIRO FACTORIZATION

Abstract

Recently, it has been demonstrated that the training problem for a wide variety of (non) linear two-layer neural networks (such as two-layer perceptrons, convolutional networks, and self-attention) can be posed as equivalent convex optimization problems, with an induced regularizer which encourages low rank. However, this regularizer becomes prohibitively expensive to compute at moderate scales, impeding training convex neural networks. To this end, we propose applying the Burer-Monteiro factorization to convex neural networks, which for the first time enables a Burer-Monteiro perspective on neural networks with non-linearities. This factorization leads to an equivalent yet computationally tractable non-convex alternative with no spurious local minima. We develop a novel relative optimality bound of stationary points of the Burer-Monteiro factorization, thereby providing verifiable conditions under which any stationary point is a global optimum. Further, for the first time, we show that linear self-attention with sufficiently many heads has no spurious local minima. Our experiments demonstrate the implications of the relative optimality bound for stationary points of the Burer-Monteiro factorization.

1. INTRODUCTION

It has been demonstrated that the training problem for (non-linear) two-layer neural networks are equivalent to convex programs (Pilanci & Ergen, 2020; Ergen & Pilanci, 2020; Sahiner et al., 2021b; Ergen et al., 2021; Sahiner et al., 2021a) . This has been observed for a variety of architectures, including multi-layer perceptrons (MLPs) (Pilanci & Ergen, 2020; Sahiner et al., 2021b) , convolutional neural networks (CNNs) (Ergen & Pilanci, 2020; Sahiner et al., 2021c) , and self-attention based transformers (Sahiner et al., 2022) . A major benefit of convex training of neural networks is that global optimality is guaranteed, which brings transparency to training neural networks. The convex formulation of neural networks induces biases by regularization of the network weights. For linear activation, the convex model directly imposes nuclear-norm regularization which is wellknown to encourage low-rank solutions (Recht et al., 2010) . For ReLU activation, however, the convex model induces a type of nuclear norm which promotes sparse factorization while the left factor is constrained to an affine space (Sahiner et al., 2021b) . This constrained nuclear-norm is NP-hard to compute. This impedes the utility of convex neural networks for ReLU activation. To address this computational challenge, we seek a method which (i) inherits the per-iteration complexity of non-convex training of neural network, and (ii) inherits the optimality guarantees and transparency of convex training. To find a solution, we leverage the well-studied Burer-Monterio (BM) factorization (Burer & Monteiro, 2003) , which was originally proposed as a heuristic method to improve the complexity of convex semi-definite programs (SDPs). BM has been applied as an efficient solution strategy for problems ranging from matrix factorization (Zheng & Lafferty, 2016; Park et al., 2017; Ge et al., 2017; Gillis, 2017) to rank minimization (Mardani et al., 2013; Recht et al., 2010; Wang et al., 2017) and matrix completion (Mardani et al., 2015; Ge et al., 2017) . BM has also been used for over-simplified neural networks such as (Kawaguchi, 2016; Haeffele & Vidal, 2017; Du & Lee, 2018) , where optimality conditions for local minima are provided. However, no work has deployed BM factorization for practical non-linear neural networks, and no guarantees are available about the optimality of stationary points. This is likely because BM theory is not applicable to the standard non-convex ReLU networks due to non-linearity between layer weights. Thus, our focus in this work is to adapt BM for practical two-layer (non-linear) convex neural networks. We consider three common architectures, namely MLPs, CNNs, and self-attention networks. For these scenarios, we develop verifiable relative optimality bounds for all local minima and stationary points, which are easy and interpretable. In light of these conditions, we identify useful insights about the nature of neural networks contributing to optimality. In particular, we observe that for self-attention networks all local minima coincide with the global optima if there are sufficiently many heads. The optimality guarantees also provide useful algorithmic insights, allowing one to verify whether the light-weight first-order methods such as SGD achieve the global optimum for the non-convex training of neural networks. Our experiments with image classification task indicate that this BM factorization enables layerwise training of convex CNNs, which allows for convex networks for the first time to match the performance of multi-layer end-to-end trained non-convex CNNs.

1.1. CONTRIBUTIONS

All in all, our contributions are summarized as follows: • We propose the BM factorization for efficiently solving convex neural networks with ReLU activation for moderate and large scales. This is the first time BM theory has been applied to the non-linear neural network setting. • We derive a novel bound on the relative optimality of the stationary points of the BM factorization for neural networks. • Accordingly, we identify simple and verifiable conditions which guarantee a stationary point of the non-convex BM formulation achieves the global optimum of the convex neural network. • We yield basic insights into the fundamental nature of neural networks that contribute to optimality; e.g. that linear self-attention has no spurious local minima if it has sufficiently many heads. • Our experiments verify the proposed relative optimality bound of stationary points from the BM factorization, and uncovers cases where SGD finds saddle points, even in two-layer neural networks.

1.2. RELATED WORK

Burer-Monteiro factorization. The Burer-Monteiro (BM) factorization was first introduced in (Burer & Monteiro, 2003; 2005) . There has been a long line of work studying the use of this factorization for solving SDPs (Boumal et al., 2016; Cifuentes & Moitra, 2019; Waldspurger & Waters, 2020; Erdogdu et al., 2021) . In the rectangular matrix case, gradient descent converges to a global optimum of the matrix factorization problem with high probability for certain classes of matrices (Zheng & Lafferty, 2016) . The BM factorization has been also studied in the rectangular case in more generic settings (Bach et al., 2008; Haeffele et al., 2014; Haeffele & Vidal, 2017) . Nuclear norm and rank minimization. The ability of nuclear norm regularization to induce low rank has been studied extensively in compressed sensing (Candès & Recht, 2009; Recht et al., 2010; Candès & Tao, 2010) . BM factorization has been applied to scale up nuclear-norm minimization (Mardani et al., 2015; 2013) . It has also been deployed for low-rank matrix factorization (Cabral et al., 2013; Zhu et al., 2017; Park et al., 2017; Ge et al., 2017) . The results show that all second-order critical points of the BM factorization are global optima if certain qualification conditions are met. SGD for non-convex neural networks. It has been shown that for over-parameterized two-layer linear networks, all local minima are global minima (Kawaguchi, 2016) . Accordingly, a line of work has attempted to show that gradient descent or its modifications provably find local minima and escape saddle points (Ge et al., 2015; Lee et al., 2016; Jin et al., 2017; Daneshmand et al., 2018) . However, these works assume Lipschitz gradients and Hessians of the non-convex objective, which is not typically satisfied. Another line of work shows that gradient descent converges to global optima for sufficiently highly over-parameterized neural networks, with either the parameter count being a high-order polynomial of the sample count (Du et al., 2018; 2019; Arora et al., 2019) , or the network architecture being simple (Du & Lee, 2018) . In practice, it has been empirically observed that SGD can converge to local maxima, or get stuck in saddle points (Du et al., 2017; Ziyin et al., 2021) . For unregularized matrix factorization, it has also recently been shown that randomly initialized gradient descent on BM factorization provably converges to global minima (Ye & Du, 2021) . Convex neural networks. ReLU neural networks have equivalent convex programs for training, such as networks with scalar outputs (Pilanci & Ergen, 2020) , vector-outputs (Sahiner et al., 2021b) , convolutional networks (Ergen & Pilanci, 2020; Sahiner et al., 2021c) , polynomial-activation networks (Bartan & Pilanci, 2021) , batch-norm based networks (Ergen et al., 2021) , Wasserstein GANs (Sahiner et al., 2021a) , and self-attention networks (Sahiner et al., 2022) . Despite efforts in developing efficient solvers, convex networks are only effectively trainable at small scales (Bai et al., 2022; Mishkin et al., 2022) . Our novelty is to adapt BM factorization as a fast and scalable solution for training convex networks, with simple, verifiable conditions for global optimality.

2. PRELIMINARIES

We denote (•) + := max{0, •} as the ReLU non-linearity. We use superscripts, say A (ii,i2) , to denote blocks of matrices, and brackets, say A[i 1 , i 2 ], to denote elements of matrices. We let 1 be the vector of ones of appropriate size, and B H be the unit H-norm ball, {u : ∥u∥ H ≤ 1}. Unless otherwise stated, let F be a convex, differentiable function. We use n to denote the number of samples, and c to denote the output dimension of each network. All proofs are presented in the Appendix.

2.1. TWO-LAYER NEURAL NETWORKS AS CONVEX PROGRAMS

A line of work has demonstrated that two-layer neural networks are equivalent to convex optimization problems. We consider a data matrix X ∈ R n×d and consider two-layer σ-activation network with c outputs, m neurons, weight-decay parameter β > 0 : p * M LP := min W1∈R d×m W2∈R c×m F (σ(XW 1 )W ⊤ 2 ) + β 2 m j=1 ∥w 1j ∥ 2 2 + ∥w 2j ∥ 2 2 . ( ) When σ is a linear activation and m ≥ m * for some m * ≤ min{d, c}, this problem is equivalent to ((Rennie & Srebro, 2005) , Section 2.2) p * LM LP = min Z∈R d×c F (XZ) + β∥Z∥ * , whereas for a ReLU activation and m ≥ m * for some unknown, problem-dependent m * ≤ nc ( (Sahiner et al., 2021b) , Thm. 3.1), p * RM LP = min Zj ∈R d×c F ( P j=1 D j XZ j ) + β P j=1 ∥Z j ∥ * ,Kj , K j := (2D j -I n )X where {D j } P j=1 = {diag (1{Xu ≥ 0}) : u ∈ R d } enumerates the possible activation patterns generated from X, and the number of such patterns satisfies P ≤ 2r e(n-1) r r , where r := rank(X) (Stanley et al., 2004; Pilanci & Ergen, 2020) . The expression (3) also involves a constrained nuclear norm expression, which is defined as ∥Z∥ * ,K := min t≥0 t s.t. Z ∈ tC (4) C := conv{Z = uv ⊤ : Ku ≥ 0, ∥u∥ 2 ≤ 1, ∥v∥ 2 ≤ 1}. This norm is a quasi-nuclear norm, which differs from the standard nuclear norm in that the factorization upon which it relies imposes a constraint on its left factors. In convex ReLU neural networks, this norm enforces the existence of {u k , v k } such that Z = k u k v ⊤ k and D j XZ = k (Xu k ) + v ⊤ k , and penalizes k ∥u k v ⊤ k ∥ * . This norm is NP-hard to compute. A variant of these ReLU activations, called gated ReLU activations, achieves the piecewise linearity of ReLU activations without enforcing the constraints (Fiat et al., 2019) . Specifically, the ReLU gates are fixed to some {h j } P j=1 to form σ(Xw 1j ) := diag (1{Xh j ≥ 0}) (Xw 1j ) = D j Xw 1j . (5) With gated ReLU activation, the equivalent convex program is given by ( (Mishkin et al., 2022) , Thm. 2.2; (Sahiner et al., 2022) , e.q. (8)) p * GM LP = min Zj ∈R d×c F ( P j=1 D j XZ j ) + β P j=1 ∥Z j ∥ * , which thereby converts the constrained nuclear norm penalty to a standard nuclear norm penalty, thereby improving the complexity of the ReLU network. In addition to the multi-layer perceptron (MLP) formulation, two-layer ReLU-activation convolutional neural networks (CNNs) with global average pooling have been demonstrated to be equivalent to convex programs as well (Sahiner et al., 2021b; c; Ergen & Pilanci, 2020) . The non-convex formulation is given by p * RCN N := min w1j ∈R h w2j ∈R c n i=1 F ( m j=1 w 2j 1 ⊤ (X i w 1j ) + ) + β 2 m j=1 ∥w 1j ∥ 2 2 + ∥w 2j ∥ 2 2 , where samples X i ∈ R K×h are represented by patch matrices, which hold a convolutional patch of size h in each of their K rows. In particular, each row of X i contains the data a convolutional kernel would perform an inner-product with, and h is the product of kernel dimensions while K is the number of patches each kernel passes over. It has been shown (Sahiner et al., 2021b ) that as long as m ≥ m * where m * ≤ nc, this is equivalent to a convex program ( (Sahiner et al., 2021b) , Cor. 5.1) p * RCN N = min Zj ∈R h×c n i=1 F (( P j=1 1 ⊤ D (i) j X i Z j ) ⊤ ) + β P j=1 ∥Z j ∥ * ,Kj K j := (2D j -I nK )X, X := X 1 • • • X n where {D j } P j=1 = diag (1{Xu ≥ 0}) : u ∈ R h and D (i) j ∈ R K×K . Noting that P ≤ 2r e(n-1) r r and r ≤ h, the only exponential dependence of P is on h, which is typically fixed. Lastly, we review existing convexity results for self-attention transformers (Sahiner et al., 2022) . We have the following non-convex objective for a single block of multi-head self-attention with m heads, where X i ∈ R s×d with s tokens and d features p * SA := min W1j ∈R d×d W2j ∈R d×c n i=1 F   m j=1 σ X i W 1j X ⊤ i X i W 2j   + β 2 m j=1 ∥W 1j ∥ 2 F + ∥W 2j ∥ 2 F , for which a variety of objectives F can be posed, including classification (e.g. F incorporates global average pooling followed by softmax-cross-entropy with labels) or denoising (e.g. F is a squared loss against a label matrix). In the linear activation case, as long as m ≥ m * , where m * ≤ min{d 2 , dc}, this is equivalent to ((Sahiner et al., 2022) , Thm. 3.1) p * LSA = min Z∈R d 2 ×dc n i=1 F d k=1 d ℓ=1 G i [k, ℓ]X i Z (k,ℓ) + β∥Z∥ * , where G i := X ⊤ i X i , G i [k, l] ∈ R, and {Z (k,ℓ) ∈ R d×c } are block matrices which form Z. A similar formulation can be posed for ReLU and Gated ReLU activations. In this work, we show that these network architectures are amenable to the BM factorization.

2.2. THE BURER-MONTEIRO FACTORIZATION

First proposed by Burer & Monteiro (2003) , the Burer-Monteiro (BM) factorization proposes to solve SDPs over some square matrix Q in terms of rectangular factors R where Q is substituted by RR ⊤ . It was first demonstrated that solving over R does not introduce spurious local minima, given rank(R) ≥ rank(Q * ) for optimal solution to the original SDP Q * (Burer & Monteiro, 2005) . In general, we seek applications where we optimize over a non-square matrix Z, i.e. p * CV X := min Z∈R d×c F (Z) for a convex, differentiable function F . One may approach this by factoring Z = UV ⊤ , where U ∈ R d×m , V ∈ R c×m for some arbitrary choice m. Then, we have an equivalent non-convex problem over R := U V , for f (R) = F (UV ⊤ ): p * CV X = min R f (R). Noting that ( 11) is convex over RR ⊤ = UU ⊤ UV ⊤ VU ⊤ VV ⊤ , one may apply directly the result of Burer & Monteiro (2005) to conclude that as long as m ≥ rank(Z * ), all local minima of ( 12) are global minima of (11) (see Appendix A.1). A major issue with these results is that rank(Z * ) is not known a priori. Naively, one may simply choose m ≥ min{d, c} and be assured that m ≥ rank(Z * ), but this approach is not satisfactory if further under-parameterization is desired. To address this issue, work from Bach et al. (2008) and Haeffele et al. (2014) demonstrates that all rank-deficient local minimizers of (12) achieve the global minimum p * CV X (under mild conditions, see Appendix A.2). A long line of work has analyzed the conditions where known non-convex optimization algorithms will converge to second-order critical points (local minima) (Ge et al., 2015; Jin et al., 2017; Daneshmand et al., 2018) . Under the assumption of a bounded f and its Hessian, a second-order critical point can be found by noisy gradient descent (Ge et al., 2015) , or other second-order algorithms (Sun et al., 2015) . Even vanilla gradient descent with random initialization has been demonstrated to almost surely converge to a local minimum for f with Lipschitz gradient (Lee et al., 2016) . However, if the gradient of f is not Lipschitz-continuous, there are no guarantees that gradient descent will find a second-order critical point of ( 12): one may encounter a stationary point which is a saddle. For example, in the linear regression setting, i.e. f (R) = ∥XUV ⊤ -Y∥ 2 F , the gradient of f is Lipschitz continuous with respect to U when V is fixed and vice-versa, but not Lipschitz continuous with respect to R (Mukkamala & Ochs, 2019). Thus, one may not directly apply the results of Ge et al. (2015) ; Sun et al. (2015) ; Lee et al. (2016) in this case. Instead, we seek to understand the conditions under which stationary points to (12) correspond to global optima of (11). One such condition is given in Mardani et al. (2013; 2015) . Theorem 2.1 (From (Mardani et al., 2013) ). Stationary points Û, V of the optimization problem p * := min U,V 1 2 ∥UV ⊤ -Y∥ 2 F + β 2 ∥U∥ 2 F + ∥V∥ 2 F ( ) correspond to global optima Z * = Û V⊤ of the equivalent convex optimization problem p * = min Z 1 2 ∥Z -Y∥ 2 F + β∥Z∥ * (15) provided that ∥Y -Û V⊤ ∥ 2 ≤ β.

3.1. MLPS

We first seek to compare the convex formulations of the MLP training problem (2), (3), and (6) to their BM factorizations. We describe how to find the BM factorization for any convex MLP. Lemma 3.1. For any matrix M ∈ R n×dc , let f (U, V) := F (MUV ⊤ ) be a differentiable function. For any β > 0 and arbitrary vector norms ∥•∥ R and ∥•∥ C , we define the Burer-Monteiro factorization p * := min U∈R dc ×m V∈R dr ×m f (U, V) + β 2   m j=1 ∥u j ∥ 2 C + ∥v j ∥ 2 R   . ( ) For the matrix norm ∥ • ∥ D defined as ∥Z∥ D := max R trace(R ⊤ Z) s.t. u ⊤ Rv ≤ 1 ∀u ∈ B C , ∀v ∈ B R , the problem ( 16) is equivalent to the convex optimization problem p * = min Z∈R dc ×dr F (MZ) + β∥Z∥ D . ( ) Remark 3.2. In the case of a linear MLP, M = X, d c = d, d r = c, and ∥ • ∥ D = ∥ • ∥ * , so using the definition of ∥ • ∥ D , in the corresponding BM factorization, R = 2 and C = 2 (Bach et al., 2008) . For a gated ReLU network, the regularizer is still the nuclear norm, and thus the same R = C = 2 regularization appears in the BM factorization. In the case of the ReLU MLP, the nuclear norm is replaced by ∥ • ∥ D = P j=1 ∥ • j ∥ * ,Kj , which in the BM factorization amounts to having the constraint K j U j ≥ 0. We accordingly express the BM factorization of convex MLPs below. p * LM LP = min U∈R d×m V∈R c×m F (XUV ⊤ ) + β 2 ∥U∥ 2 F + ∥V∥ 2 F (19) p * GM LP = min Uj ∈R d×m Vj ∈R c×m F ( P j=1 D j XU j V ⊤ j ) + β 2 P j=1 ∥U j ∥ 2 F + ∥V j ∥ 2 F (20) p * RM LP = min Uj ∈R d×m :(2Dj -In)XUj ≥0 Vj ∈R c×m F ( P j=1 D j XU j V ⊤ j ) + β 2 P j=1 ∥U j ∥ 2 F + ∥V j ∥ 2 F (21) To the best of our knowledge, ( 21) presents the first application of BM factorization to a non-linear neural network, which is enabled by the convex model (3). In the linear case, the BM factorization ( 19) is identical to the original non-convex formulation of a linear MLP with m neurons. Furthermore, in the case of gated ReLU, the BM factorization when m = 1 is equivalent to the original non-convex formulation. However, for ReLU activation two-layer networks, the BM factorization even when m = 1 corresponds to a different (i.e. constrained, rather than ReLU activation) model than the non-convex formulation. While the original convex program is NP-hard, the computation of the cost function of the BM factorization is very simple. Thus, the per-iteration complexity of the BM factorization is much lower than for the convex ReLU MLP. The BM factorizations of these convex MLPs are non-convex, hence finding a global minimum appears intractable. However, the following theorem demonstrates that as long as a rank-deficient local minimum to the BM factorization is obtained, it corresponds to a global optimum. Theorem 3.3. If m ≥ rank(Z * ), where Z * is a minimizer of ( 18), all local minima of the BM factorization ( 16) are global minima. Furthermore, if F is twice-differentiable, any rank-deficient local minimum R := Û V of ( 16) corresponds to a global minimizer Z * = Û V⊤ of ( 18). This result demonstrates that these two-layer convex MLPs have no spurious local minima under mild conditions. However, there remains an algorithmic challenge: it is not straightforward to obtain a guaranteed local minima when the gradients of f are not Lipschitz continuous. The following result provides a general condition under which stationary points of the ( 16) are global optima of (18). Theorem 3.4. For any non-negative objective function F , for a stationary ( Û, V) of ( 16) with corresponding Ẑ = Û V⊤ with objective p for (18), the relative optimality gap p-p * p * satisfies p -p * p * ≤ ∥∇ Z F (M Ẑ)∥ * D β -1 + (22) where ∥ • ∥ * D is the dual norm of ∥ • ∥ D . In particular, this bound can be calculated by taking the gradient of the unregularized objective function, evaluated at candidate solution Ẑ to the convex problem (18), which is formed by the stationary point of BM problem ( 16). Intuitively, if the ∥ • ∥ * D norm of this solution is less than β, then by the subgradient condition, Ẑ is an optimal solution of (18). In the case of a linear MLP with X = I d , F a squared-loss objective, and ∥∇ Z F (M Ẑ)∥ * D ≤ β, our result exactly replicates the result of Theorem 2.1 from Mardani et al. (2013) . Furthermore, when this condition is not exactly satisfied, ( 22) provides a novel result in the form of an optimality gap bound. To our knowledge, this is the first result that generalizes the optimality conditions for stationary points from any BM factorization of a neural network. This provides an easily computable bound after solving (16) which quantifies how close a solution is to the global minimum. In the case of a ReLU MLP, the relative optimality gap is given by p -p * p * ≤      max j∈[P ] u∈B2 Kj u≥0 1 β ∥∇ Zj F ( P j ′ =1 D j ′ X Ẑj ′ )u∥ 2 -1      + . ( ) Computing this quantity amounts to solving a cone-constrained PCA problem (Deshpande et al., 2014) , which can be done in polynomial-time when d is constant. We should note that some stationary points are clearly present in any problem, such as ( Û, V) = (0, 0), so we cannot conclude that all stationary points are global optima. However, in certain cases, the optimality gap of stationary points ( 22) is always zero as we show next. Theorem 3.5. A stationary point ( Û, V) of ( 16) is a global minimizer of (18 ) if R = C = 2 and rank( Û) = rank( V) = min{d c , d r }. Thus, for linear and gated ReLU MLPs, we can ensure that if the Burer-Monteiro factorization achieves a stationary point with full rank, it is corresponds with the global optimum of the convex program. We now can further extend these results to CNNs and self-attention architectures.

3.2. CNNS

Before proceeding to explore the BM factorization in the context of two-layer CNNs, we first provide a new result on an equivalent convex program for two-layer ReLU CNNs with arbitrary linear pooling operations, which extends the results of Sahiner et al. (2021b) ; Ergen & Pilanci (2020) on Global Average Pooling CNNs. Define P a ∈ R a×K to be a linear pooling matrix which pools the K spatial dimensions to an arbitrary size a. Then, we express the non-convex two-layer CNN problem as p * CN N := min w1j ∈R h W2j ∈R c×a n i=1 F   m j=1 W 2j P a σ(X i w 1j )   + β 2 m j=1 ∥w 1j ∥ 2 2 + ∥W 2j ∥ 2 F . ( ) Theorem 3.6. For β > 0 and ReLU activation σ(•) = (•) + , if m ≥ m * where m * ≤ nac, then ( 25) is equivalent to a convex optimization problem, given by p * CN N = min Z k ∈R h×ac n i=1 F    P k=1    trace(P a D (i) k X i Z (1) k ) . . . trace(P a D (i) k X i Z (c) k )       + β P k=1 ∥Z k ∥ * ,K k , K k := (2D k -I nK ) X 1 • • • X n , Z (c ′ ) k ∈ R h×a ∀c ′ ∈ [c]. Thus, we provide a novel result which characterizes two-layer CNNs with arbitrary linear pooling operations as a convex program. Similar results can be shown for the linear and gated-ReLU activation casesfoot_0 . With this established, we present our main results on the BM factorization for CNNs. Lemma 3.7. The BM factorization of the convex CNN problem with ReLU activation is given as follows. p * RCN N = min {{u jk ∈R h } m j=1 } P k=1 {{V jk ∈R c×a } m j=1 } P k=1 (2D (i) k -I)Xiu jk ≥0 n i=1 F   P k=1 m j=1 V jk P a D (i) k X i u jk   + β 2 P k=1 m j=1 ∥u jk ∥ 2 F + ∥V jk ∥ 2 F ( ) The BM factorization closely resembles the original non-convex formulation (25). Generally, ( 27) inherits the results of Theorems (3.3), (3.4), and (3.5); we present one such corollary here. Corollary 3.7.1. A stationary point ((û jk , Vjk ) m j=1 ) P k=1 of ( 27) corresponds to a global minimizer Ẑk = m j=1 ûjk vec Vjk ⊤ of (26) provided that ∥ n i=1 ∇ Z k F    P k ′ =1    trace(P a D (i) k ′ X i Z (1) k ′ ) . . . trace(P a D (i) k ′ X i Z (c) k ′ )       u∥ 2 ≤ β, ∀k ∈ [P ], ∀u ∈ B 2 : (2D (i) k -I)X i u ≥ 0. (28)

3.3. MULTI-HEAD SELF-ATTENTION

We now for the first time extend BM factorization theory to self-attention networks. Lemma 3.8. The BM factorization of the convex self-attention problem with linear activationfoot_1 is given as follows. p * LSA = min Uj ∈R d×d Vj ∈R d×c n i=1 F   m j=1 X i U j X ⊤ i X i V j   + β 2 m j=1 ∥U j ∥ 2 F + ∥V j ∥ 2 F ( ) In addition to inheriting all of the results of Theorems 3.  R := vec( Û1 ) • • • vec( Ûm ) vec( V1 ) • • • vec( Vm ) ∈ R d(d+c)×m (30) is rank-deficient, then this local minimum is also a global minimum of (10). In this section, we illustrate the utility of our proposed relative optimality bound for stationary points in the setting of two-layer fully-connected networks. We also seek to examine how this bound changes with respect to the number of samples n, the regularization parameter β (which controls the sparsity of the convex solution), and the number of factors in the BM factorization m. We initialize a class-balanced three-class spiral data set with varied number of samples n (see Figure 1 for examples). For this dataset, we then train the gated ReLU MLP BM factorization (20) with varying number of factors m. We then compare the stationary points of these BM factorizations found by gradient descent (GD) to the global optimum, which we compute from (6).

4. EXPERIMENTAL RESULTS: THE RELATIVE OPTIMALITY GAP BOUND

For each stationary point of the BM factorization, we compute the relative optimality gap bound provided in our result in Theorem 3.4. We note that since d = 2, c = 3 in this case, for all j, rank(Z * j ) ≤ 2, so as long as m ≥ 2 all local minima of the BM factorization are global minima (Burer & Monteiro, 2005; Haeffele et al., 2014) . While Lee et al. (2016) demonstrated that gradient descent with a random initialization converges to a local optimum almost surely for losses f whose gradient is Lipschitz continuous, we use squared loss with one-hot-encoded class labels, for which f is not Lipschitz continuous (Mukkamala & Ochs, 2019). Thus, there is no guarantee that GD will find the global minimum. We display results over β for each fixed n in Figure 2 . For larger values of β, it becomes much easier for GD to find an optimal solution. We nevertheless find that our bound gives a useful proxy for whether the BM factorization has converged to the global minimum. Furthermore, noting from Lemma A.6, any optimal solution Z * = Û V⊤ of ( 37) is equivalent to any optimal solution (34) by constructing R * = Û V , X * = R * R * ⊤ . Furthermore, m * = rank(Z * ) = rank(R * ) = X * . Putting everything together, we have that a rank-m * global optimum Z * = Û V⊤ to (37) corresponds to a rank-m * global optimum X * to (34) (Lemma A.6), which, as long as m ≥ m * , corresponds to a local minimum R = Û V (Lemma A.5), which is exactly the local minimum ( Û, V) of (39). A.2 RESULT OF (HAEFFELE ET AL., 2014) AND ITS APPLICATION TO RECTANGULAR

MATRICES

In this subsection, we outline the precise theoretical statement of (Haeffele et al., 2014) and describe exactly how it corresponds to our summary in Section 2.2, and thus the application to the later derivations in our work. We first describe the following result, without proof, from (Haeffele et al., 2014) . Theorem A.8 (Theorem 2 of (Haeffele et al., 2014) ). Let F ′ : S + n → R be of the form such that F ′ (X) = G ′ (X) + H ′ (X) , where G ′ is convex, twice differentiable with compact level sets, and H ′ is a proper convex function such that F ′ is lower semi-continuous. Then, if R is a rank-deficient local minimizer of min R∈R (d+c)×m F ′ (RR ⊤ ), then X * = R R⊤ is a global minimizer of min X∈R (d+c)×(d+c) X⪰0 F ′ (X). We now describe how this theorem applies to the setting described in Section 2.2, i.e. the rectangular matrix, non-SDP case. Lemma A.9 (Used in Section 2.2). Let F : R d×c → R be of the form such that F (Z) = G(Z) + H(Z), where G is convex, twice differentiable with compact level sets, and H is a proper convex function such that F is lower semi-continuous. Then, if R = Û V is a rank-deficient local minimizer of min U∈R d×m V∈R d×m F (UV ⊤ ), then Z * = Û V⊤ is a global minimizer of min Z∈R d×c F (Z). Proof. Define F ′ : R (d+c)×(d+c) → R such that F ′ ( X 1 X 2 X 3 X 4 ) := F (X 2 ) (45) for X 1 ∈ R d×d , X 2 ∈ R d×c , X 3 ∈ R c×d , X 4 ∈ R c×c . Clearly, if F = G + H, where G is twice-differentiable and H is proper convex, then, F ′ = G ′ + H ′ where G ′ is twice- differentiable and H ′ is proper convex. From the proof of Lemma A.7, we know that ( 43) is exactly the same as (41) for R = U V . Furthermore, we know from Lemma A.6 that any global minimum Z * of (44) corresponds to a global minimum X * of (42). Lastly, we know from Theorem A.8 that any rank-deficient local minimum of (41) corresponds to a global minimizer of (42). Putting it all together, we have that a rank-deficient local minimizer R = Û V of ( 43) is a rank-deficient local minimizer of (41), which corresponds to a global optimizer X * = R R⊤ of (42) (Theorem A.8), which corresponds to a global optimizer Z * = Û V⊤ of (44) (Lemma A.6). A.3 PROOF OF LEMMA 3.1 Proof. We first analyze the solution of the following optimization problem f * = min uj ,vj 1 2   m j=1 ∥u j ∥ 2 C + ∥v j ∥ 2 R   s.t. UV ⊤ = Z. ( ) We can write this as an equivalent problem here (Bach et al., 2008; Pilanci & Ergen, 2020) : f * = min uj ∈B C ,vj   m j=1 ∥v j ∥ R   s.t. UV ⊤ = Z. ( ) We can form the Lagrangian of this as f * = min uj ∈B C ,vj max R   m j=1 ∥v j ∥ R   -trace R ⊤ UV ⊤ + trace R ⊤ Z . ( ) By Sion's minimax theorem, we can switch the maximum over R and minimum over V and minimize over V to obtain f * = min u∈B C max R trace R ⊤ Z s.t.∥u ⊤ R∥ * R ≤ 1. As long as m ≥ rank(Z), by Slater's condition, we can switch the minimum and maximum (Shapiro, 2009) to obtain f * = max R trace R ⊤ Z s.t.∥u ⊤ R∥ * R ≤ 1 ∀u ∈ B C . ( ) By the definition of dual norm, we can also write this as f * = max R trace R ⊤ Z s.t.u ⊤ Rv ≤ 1 ∀u ∈ B C , ∀v ∈ B R = ∥Z∥ D . Thus, with this result, we have p * := min U∈R dc ×m V∈R dr ×m f (U, V) + β 2   m j=1 ∥u j ∥ 2 C + ∥v j ∥ 2 R   , equivalently as p * = min U∈R dc ×m V∈R dr ×m F (MUV ⊤ ) + β 2   m j=1 ∥u j ∥ 2 C + ∥v j ∥ 2 R   , or also as p * = min Z:rank(Z)≤m F (MZ) + min U∈R dc ×m V∈R dr ×m UV ⊤ =Z β 2   m j=1 ∥u j ∥ 2 C + ∥v j ∥ 2 R   , where we now apply our previous result to obtain p * = min Z:rank(Z)≤m F (MZ) + β∥Z∥ D , which if rank(Z * ) ≥ m is equivalent to p * = min Z F (MZ) + β∥Z∥ D . A.4 PROOF OF THEOREM 3.3 Proof. We simply note that ( 16) is the Burer-Monteiro factorization of (18). Thus, from Lemma A.7, as long as m ≥ rank(Z * ), all local minima of ( 16) are global minima. Furthermore, note that ( 18) is composed of two components, one of which is a twice-differentiable function, and the other is a proper convex function. Thus, by Lemma A.9, all rank-deficient local minima are global minima. A.5 PROOF OF THEOREM 3.4 Proof. Stationary points of ( 16) satisfy 0 ∈ ∇ U f ( Û, V) + β∥ Û∥ C ∂∥ Û∥ C 0 ∈ ∇ V f ( Û, V) + β∥ V∥ R ∂∥ V∥ R , where we define ∥ Û∥ C := m j=1 ∥û j ∥ C and the same for ∥ V∥ R . By the definition of the subgradient, this stationarity condition can be written as ∃U ′ s.t. trace( Û⊤ U ′ ) = ∥ Û∥ C , ∥U ′ ∥ * C ≤ 1, 0 = ∇ U f ( Û, V) + β∥ Û∥ C U ′ ∃V ′ s.t. trace( V⊤ V ′ ) = ∥ V∥ R , ∥V ′ ∥ * R ≤ 1, 0 = ∇ V f ( Û, V) + β∥ V∥ R V ′ . By the chain rule, we have that 0 = ∇ Z F (M Ẑ) V + β∥ Û∥ C U ′ 0 = ∇ Z F (M Ẑ) ⊤ Û + β∥ V∥ R V ′ We now right-multiply the top equation by Û⊤ and the bottom equation by V⊤ to obtain 0 = ∇ Z F (M Ẑ) V Û⊤ + β∥ Û∥ C U ′ Û⊤ (57) 0 = ∇ Z F (M Ẑ) ⊤ Û V⊤ + β∥ V∥ R V ′ V⊤ . ( ) Taking the trace, we have - 1 β trace ∇ Z F (M Ẑ) ⊤ Ẑ = ∥ Û∥ C trace U ′ Û⊤ = ∥ Û∥ R trace V ′ V⊤ . Noting the definitions of U ′ and V ′ , we have - 1 β trace ∇ Z F (M Ẑ) ⊤ Ẑ = ∥ Û∥ 2 C = ∥ V∥ 2 R (60) Furthermore, since clearly F (M Ẑ) = f ( Û, V), we have that - 1 β trace ∇ Z F (M Ẑ) ⊤ Ẑ = ∥ Û∥ 2 C = ∥ V∥ 2 R = 1 2 (∥ Û∥ 2 C + ∥ V∥ 2 R ) = ∥ Ẑ∥ D . Now, we examine the optimality conditions for (18). In particular, we have p * = min Z F (MZ) + β∥Z∥ D , which by definition of the dual norm, is equivalent to p * = min Z max Z ′ ∈B D * F (MZ) + βtrace(Z ⊤ Z ′ ). Now suppose we have an approximate saddle point ( Ẑ, Ẑ′ ) with objective value p such that ∇ Z F ( Ẑ)+ β Ẑ′ = 0, trace( Ẑ⊤ Ẑ′ ) = ∥ Ẑ∥ D , and Ẑ′ ∈ (1 + ϵ)B D * for some ϵ ≥ 0. Let Z = 1 1+ϵ Ẑ′ . By strong duality and non-negativity of F , we have p * = max Z ′ ∈B D * min Z F (MZ) + βtrace(Z ⊤ Z ′ ) (64) ≥ min Z F (MZ) + βtrace(Z ⊤ Z) (65) = min Z F (MZ) + β 1 + ϵ trace(Z ⊤ Ẑ′ ) (66) ≥ 1 1 + ϵ min Z F (MZ) + βtrace(Z ⊤ Ẑ′ ) (67) = 1 1 + ϵ p (68) Rearranging, we have that p -p * p * ≤ ϵ. ( ) For the assumptions of this inequality to hold, for any candidate solution Ẑ, one must satisfy ∇ Z F (M Ẑ) + β Ẑ′ = 0 and trace( Ẑ⊤ Ẑ′ ) = ∥ Ẑ∥ D . Solving the former equality for Ẑ′ = -1 β ∇ Z F (M Ẑ), we have by ( 61) that the latter equality is satisfied at any stationary point of the BM factorization. Lastly, for (69) to hold for a particular ϵ, one must have Ẑ′ = - 1 β ∇ Z F (M Ẑ) ∈ (1 + ϵ)B D * , ( ) so ϵ = ∥∇ Z F (M Ẑ)∥ * D β -1 + , i.e. p -p * p * ≤ ∥∇ Z F (M Ẑ)∥ * D β -1 + A.6 PROOF OF THEOREM 3.5 Proof. From the stationary point condition, we have 0 = ∇ Z F (M Ẑ) V + β Û (72) 0 = ∇ Z F (M Ẑ) ⊤ Û + β V. From ( 73) we can obtain Û⊤ ∇ Z F (M Ẑ) = -β V⊤ . (74) Substituting this into (72), we have 0 = -β V⊤ V + β Û⊤ Û (75) V⊤ V = Û⊤ Û. ( ) Thus, let r := rank( V) = rank( Û) ≤ min{d c , d r }. We can write the compact SVD of the stationary point as Û = L U ΛR ⊤ V = L V ΛR ⊤ , where L U ∈ R dc×r , L V ∈ R dr×r . Assume without loss of generality that d c > d r , so d r = min{d c , d r }. We have 0 = ∇ Z F (M Ẑ) V + β Û (77) 0 = ∇ Z F (M Ẑ)L V ΛR ⊤ + βL U ΛR ⊤ (78) -βL U = ∇ Z F (M Ẑ)L V . ( ) If r = b, L is square and therefore unitary, so ∥∇ Z F (M Ẑ)∥ 2 = ∥∇ Z F (M Ẑ)L V ∥ 2 (80) = ∥ -βL U ∥ 2 (81) = β. (82) Thus, we satisfy ( 22) with equality. In general, note that when r < min{d c , d r }, we have ∥∇ Z F (M Ẑ)∥ 2 = ∥∇ Z F (M Ẑ)∥ 2 ∥L V ∥ 2 ≥ ∥∇ Z F (M Ẑ)L V ∥ 2 (83) = ∥ -βL U ∥ 2 (84) = β, so ( 22) is a lower bound, which depends on how ∇ Z F ( Ẑ) behaves when operating on vectors in null(L ⊤ V ). A.7 PROOF OF THEOREM 3.6 Proof. We begin with the non-convex objective ( 25) p * CN N := min w1j ∈R h W2j ∈R c×a n i=1 F   m j=1 W 2j P a (X i w 1j ) +   + β 2 m j=1 ∥w 1j ∥ 2 2 + ∥W 2j ∥ 2 F . ( ) We can re-write this as (Bach et al., 2008; Pilanci & Ergen, 2020 ) p * CN N = min w1j ∈B2 W2j ∈R c×a n i=1 F   m j=1 W 2j P a (X i w 1j ) +   + β m j=1 ∥W 2j ∥ F . ( ) We can also re-write this as p * CN N = min w1j ∈B2 W2j ∈R c×a ri n i=1 F (r i ) + β m j=1 ∥W 2j ∥ F s.t. m j=1 W 2j P a (X i w 1j ) + = r i (88) Forming the Lagrangian, we have p * CN N = min w1j ∈B2 W2j ∈R c×a ri max vi n i=1 F (r i ) + β m j=1 ∥W 2j ∥ F + n i=1 v ⊤ i   m j=1 W 2j P a (X i w 1j ) + -r i   (89) By Sion's minimax theorem, we can swap the minimum over W 2j , r i and maximum over v i , and minimize over W 2j , r i to obtain p * CN N = min u∈B2 max vi - n i=1 -F * (v i ) s.t. ∥ n i=1 P a (X i u) + v ⊤ i ∥ F ≤ β (90) where F * is the Fenchel conjugate of F . Now, as long as β > 0 and m ≥ m * where m * ≤ nac, we can switch the order of max and min by Slater's condition (Shapiro, 2009; Sahiner et al., 2021b) to obtain p * CN N = max vi - n i=1 -F * (v i ) s.t. max u∈B2 ∥ n i=1 P a (X i u) + v ⊤ i ∥ F ≤ β. ( ) Enumerating over the hyperplane arrangements {D k } P k=1 , we can further write this as p * CN N = max vi - n i=1 -F * (v i ) s.t. max k∈[P ] u∈B2 (2D (i) k -I)Xiu≥0 ∥ n i=1 P a D (i) k X i uv ⊤ i ∥ F ≤ β. ( ) Now, noting that vec(ABC) = (C ⊤ ⊗ A)vec(B) (Magnus & Neudecker, 2019) , this is equivalent to p * CN N = max vi - n i=1 F * (v i ) s.t. max k∈[P ] u∈B2 (2D (i) k -I)Xiu≥0 ∥ n i=1 (v i ⊗ P a D (i) k X i )u∥ 2 ≤ β (93) This may also be written further as p * CN N = max vi - n i=1 F * (v i ) s.t. max k∈[P ] u∈B2 (2D (i) k -I)Xiu≥0 g∈B2 g ⊤ n i=1 (v i ⊗ P a D (i) k X i )u ≤ β, and thereby as p * CN N = max vi - n i=1 F * (v i ) s.t. max k∈[P ] u∈B2 (2D (i) k -I)Xiu≥0 g∈B2 trace n i=1 (v i ⊗ P a D (i) k X i )ug ⊤ ≤ β. Now, we let Z = ug ⊤ to obtain p * CN N = max vi - n i=1 F * (v i ) s.t. max k∈[P ] Z=ug ⊤ u∈B2 (2D (i) k -I)Xiu≥0 g∈B2 trace n i=1 (v i ⊗ P a D (i) k X i )Z ≤ β. ( ) We let C k := conv ug ⊤ : (2D (i) k -I)X i u ≥ 0, u ∈ B 2 , g ∈ B 2 and note that since our objective is linear we can take the convex hull of the constraints without changing the objective, to obtain p * CN N = max vi - n i=1 F * (v i ) s.t. max k∈[P ] Z∈C k trace n i=1 (v i ⊗ P a D (i) k X i )Z ≤ β. ( ) Note that the constraint Z C k is equivalent to stating that ∥Z∥ * ,K k ≤ 1 for the constrained nuclear norm definition with K k = (2D k -I)X. Then, we have p * CN N = max vi - n i=1 F * (v i ) s.t. max k∈[P ] ∥Z∥ * ,K k ≤1 trace n i=1 (v i ⊗ P a D (i) k X i )Z ≤ β. ( ) Now, we form the Lagrangian, given by p * CN N = max vi min ∥Z k ∥ * ,K k ≤1 min λ k ≥0 - n i=1 F * (v i )+ P k=1 λ k β - n i=1 vec(Z k ) ⊤ vec v ⊤ i ⊗ (P a D (i) k X i ) ⊤ . (99) By Sion's minimax theorem, we are permitted to change the order of the maxima and minima, to obtain p * CN N = min λ k ≥0 min ∥Z k ∥ * ,K k ≤1 max vi - n i=1 F * (v i )+ P k=1 λ k β - n i=1 vec(Z k ) ⊤ vec v ⊤ i ⊗ (P a D (i) k X i ) ⊤ . (100) Now, defining K a,1 as the (a, 1) commutation matrix we have the following identity from (Magnus & Neudecker, 2019) : vec v ⊤ i ⊗ (P a D (i) k X i ) ⊤ = I c ⊗ (K a,1 ⊗ I h )(vec(X ⊤ i D (i) k P ⊤ a ) vec(v i ). Using this identity and maximizing over v i , we obtain p * CN N = min ∥Z k ∥ * ,K k ≤1 min λ k ≥0 n i=1 F P k=1 I c ⊗ vec(X ⊤ i D (i) k P ⊤ a ) ⊤ (K 1,a ⊗ I h ) vec(Z k ) +β P k=1 λ k . (101) Rescaling such that Zk = λ k Z k , we obtain p * CN N = min Z k ∈R h×ac n i=1 F P k=1 I c ⊗ vec(X ⊤ i D (i) k P ⊤ a ) ⊤ (K 1,a ⊗ I h ) vec(Z k ) +β P k=1 ∥Z k ∥ * ,K k . (102) Simplifying further, we can write this as p * CN N = min Z k ∈R h×ac n i=1 F    P k=1    trace(P a D (i) k X i Z (1) k ) . . . trace(P a D (i) k X i Z (c) k )       + β P k=1 ∥Z k ∥ * ,K k , where Z (c ′ ) k ∈ R h×a . A.8 PROOF OF LEMMA 3.7 Proof. We start with the convex formulation p * RCN N = min Z k ∈R h×ac n i=1 F    P k=1    trace(P a D (i) k X i Z (1) k ) . . . trace(P a D (i) k X i Z (c) k )       + β P k=1 ∥Z k ∥ * ,K k . In order to compute the Burer-Monteiro we factor Z k = U k V ⊤ k , where U k ∈ R h×m , V k ∈ R ac×m , and (2D (i) k -I)X i U k ≥ 0. Then, with V (c ′ ) k ∈ R a×m . Then for each k,    trace(P a D (i) k X i Z (1) k ) . . . trace(P a D (i) k X i Z (c) k )    =     trace(P a D (i) k X i U k V (1) k ⊤ ) . . . trace(P a D (i) k X i U k V (c) k ⊤ )     =     trace(V (1) k ⊤ P a D (i) k X i U k ) . . . trace(V (c) k ⊤ P a D (i) k X i U k )     =     m j=1 v (1) jk ⊤ P a D (i) k X i u jk . . . m j=1 v (c) jk ⊤ P a D (i) k X i u jk     = m j=1 V ⊤ jk P a D (i) k X i u jk , where V ⊤ jk :=    v (1) jk ⊤ • • • v (c) jk ⊤    ∈ R c×a . The equivalent Burer-Monteiro formulation thus is given by p * RCN N = min {{u jk ∈R h } m j=1 } P k=1 {{V jk ∈R c×a } m j=1 } P k=1 (2D (i) k -I)Xiu jk ≥0 n i=1 F   P k=1 m j=1 V jk P a D (i) k X i u jk   + β 2 P k=1 m j=1 ∥u jk ∥ 2 F + ∥V jk ∥ 2 F . A.9 PROOF OF COROLLARY 3.7.1 Proof. We simply apply the result of Theorem 3.4, noting that stationary points correspond to global minima if the norm of the gradient is less than β. Thus, this condition is equivalent to ∥ n i=1 ∇ Z k F    P k ′ =1    trace(P a D (i) k ′ X i Z (1) k ′ ) . . . trace(P a D (i) k ′ X i Z (c) k ′ )       u∥ 2 ≤ β, ∀k ∈ [P ], ∀u ∈ B 2 : (2D (i) k -I)X i u ≥ 0. A.10 PROOF OF LEMMA 3.8 Proof. We begin from the convex formulation (10): p * LSA = min Z∈R d 2 ×dc n i=1 F d k=1 d ℓ=1 G i [k, ℓ]X i Z (k,ℓ) + β∥Z∥ * . Now, we seek to find the Burer-Monteiro factorization. We let Z = UV ⊤ , where U ∈ R d 2 ×m and V ∈ R dc×m . Let vec -1 (u j ) ∈ R d×d be the result of taking chunks of d-length vectors from u j for j ∈ [m] and stacking them in columns. Similarly, let vec -1 (v j ) ∈ R c×d be the result of taking chunks of c-length vectors from j and stacking them in columns. Furthermore, we will let vec -1 (u j ) k be the kth column of vec -1 (u j ). Then, recognize that Z (k,ℓ) = m j=1 vec -1 (u j ) k vec -1 (v j ) ⊤ ℓ . Thus, d k=1 d ℓ=1 G i [k, ℓ]X i Z (k,ℓ) = d k=1 d ℓ=1 m j=1 G i [k, ℓ]X i vec -1 (u j ) k vec -1 (v j ) ⊤ ℓ = X i d k=1 d ℓ=1 m j=1 vec -1 (u j ) k G i [k, ℓ]vec -1 (v j ) ⊤ ℓ = X i m j=1 vec -1 (u j )G i vec -1 (v j ) ⊤ = m j=1 X i vec -1 (u j )X ⊤ i X i vec -1 (v j ) ⊤ . Now, overloading notation, let vec -1 (u j ) = U j and vec -1 (v j ) ⊤ = V j . We have clearly that the Burer-Monteiro factorization of ( 10) is given by p * LSA = min Uj ∈R d×d Vj ∈R d×c n i=1 F   m j=1 X i U j X ⊤ i X i V j   + β 2 m j=1 ∥U j ∥ 2 F + ∥V j ∥ 2 F . A.11 PROOF OF COROLLARY 3.8.1 Proof. We simply nee dto apply the result of 3.3 to this setting. In this case, the non-convex linear selfattention network is equivalent to the Burer-Monteiro factorization of the convex form. To obtain this Burer-Monteiro factorization, we factorize convex weights Z ∈ R d 2 ×dc , so rank(Z * ) ≤ min{d 2 , dc}. Thus, letting m * = rank(Z * ) ≤ min{d 2 , dc}, we can observe that as long as the number of heads m exceeds m * , from Lemma A.7, all local optima are global. Further, we can form R = Û V , and as long as this is a rank-deficient local minimum, it also corresponds to a global minimum when F is twice-differentiable by Lemma A.9.

B.1 MLPS

The following theorem demonstrates that we can extend the results of (Sahiner et al., 2021b)  F ( m j=1 (Xw 1j ) + w 2j ) + β 2   m j=1 ∥w 1j ∥ 2 C + ∥w 2j ∥ 2 R   ( ) is equivalent to the convex training objective p * = min Zj F ( P j=1 D j XZ j ) + β P j=1 ∥Z j ∥ Dj , as long as β > 0 and m ≥ m * where m * ≤ nc, where ∥Z∥ Dj := max R trace(R ⊤ Z) s.t. u ⊤ Rv ≤ 1 ∀u ∈ B C : (2D j -I n )Xu ≥ 0, ∀v ∈ B R . Proof. We start by re-stating the convex objective (Bach et al., 2008; Pilanci & Ergen, 2020 ) p * := min w1j ∈B C w2j F ( m j=1 (Xw 1j ) + w 2j ) + β m j=1 ∥w 2j ∥ R . Then, we can re-write this in a constrained form p * = min w1j ∈B C w2j R F (R) + β m j=1 ∥w 2j ∥ R s.t. m j=1 (Xw 1j ) + w 2j = R, and then the Lagrangian p * = min w1j ∈B C w2j R max V F (R) + β m j=1 ∥w 2j ∥ R + trace   V ⊤   m j=1 (Xw 1j ) + w 2j -R     . ( ) By Sion's minimax theorem, we can swap the order of the maximization over V and minimization over w 2j and R. Then, minimizing over these two, we have p * = min u∈B C max V -F * (V) s.t.∥V ⊤ (Xu) + ∥ * R ≤ β. ( ) By Slater's condition, which holds when β > 0 and m ≤ m * where m * ≤ nc (Shapiro, 2009; Sahiner et al., 2021c) , we can switch the order of minimum and maximum to obtain p * = max V -F * (V) s.t. max u∈B C ∥V ⊤ (Xu) + ∥ * R ≤ β. Introducing hyperplane arrangements, we have p * = max V -F * (V) s.t. max j∈[P ] u∈B C (2Dj -I)Xu≥0 ∥V ⊤ D j Xu∥ * R ≤ β. ( ) By the concept of dual norm, this is equivalent to p * = max V -F * (V) s.t. max j∈[P ] u∈B C (2Dj -I)Xu≥0 g∈Br trace V ⊤ D j Xug ⊤ ≤ β. Define ∥Z∥ j := max t≥0 t s.t. Z ∈ tconv{ug ⊤ : u ∈ B C , (2D j -I)Xu ≥ 0, g ∈ B r }. Then, we can write our problem as p * = max V -F * (V) s.t. max j∈[P ] ∥Z∥j ≤1 trace V ⊤ D j XZ ≤ β. Now, observe that ∥Z∥ j = ∥Z∥ Dj . In particular, let us examine (112), which we can re-write as ∥Z∥ Dj = max R trace(R ⊤ Z) s.t. max ∥Z∥j ≤1 trace(Z ⊤ R) ≤ 1 (122) = max R trace(R ⊤ Z) s.t. ∥R∥ * j ≤ 1 (123) = ∥Z∥ j , where the simplifications are made noting the definition of the dual norm. Now, we can write our objective as p * = max V -F * (V) s.t. max j∈[P ] ∥Z∥ D j ≤1 trace V ⊤ D j XZ ≤ β. ( ) Forming the Largrangian, we have p * = max V min ∥Zj ∥ D j ≤1 λj ≥0 -F * (V) + P j=1 λ j β -trace V ⊤ D j XZ j . By Sion's minimax theorem, we can switch max and min and solve over V to obtain p * = min ∥Zj ∥ D j ≤1 λj ≥0 F ( P j=1 λ j D j XZ j ) + β P j=1 λ j . Lastly, we can combine Z j λ j into one variable to obtain p * = min Zj F ( P j=1 D j XZ j ) + β P j=1 ∥Z j ∥ Dj , as desired.

B.2 CNNS

Lemma B.2. The Burer-Monteiro factorization of the convex CNN problem with linear and gated ReLU activation are given as follows. p * LCN N = min {uj ∈R h } m j=1 {Vj ∈R c×a } m j=1 n i=1 F   m j=1 V j P a X i u j   + β 2 m j=1 ∥u j ∥ 2 2 + ∥V j ∥ 2 F (129) p * GCN N = min {{u jk ∈R h } m j=1 } P k=1 {{V jk ∈R c×a } m j=1 } P k=1 n i=1 F   P k=1 m j=1 V jk P a D (i) k X i u jk   + β 2 P k=1 m j=1 ∥u jk ∥ 2 F + ∥V jk ∥ 2 F ( ) Proof. The proofs follow almost identically from the proof of 3.7. In the linear case, the convex objective is given by p * LCN N = min Z∈R h×ac n i=1 F       trace(P a X i Z (1) ) . . . trace(P a X i Z (c) )       + β∥Z∥ * . In order to compute the Burer-Monteiro factorization, we factor Z = UV ⊤ , where U ∈ R h×m , V ∈ R ac×m . Then, with V (c ′ ) ∈ R a×m . Then,    trace(P a X i Z (1) ) . . . trace(P a X i Z (c) )    =     trace(P a X i UV (1) ⊤ ) . . . trace(P a X i UV (c) ⊤ )     =     trace(V (1) ⊤ P a X i U) . . . trace(V (c) ⊤ P a X i U)     =     m j=1 v (1) j ⊤ P a X i u j . . . m j=1 v (c) j ⊤ P a X i u j     = m j=1 V ⊤ j P a X i u j , where V ⊤ j :=    v (1) j ⊤ • • • v (c) j ⊤    ∈ R c×a . The equivalent Burer-Monteiro formulation thus is given by p * LCN N = min {uj ∈R h } m j=1 {Vj ∈R c×a } m j=1 n i=1 F   m j=1 V j P a X i u j   + β 2 m j=1 ∥u j ∥ 2 2 + ∥V j ∥ 2 F . In the gated ReLU case, the convex program is given by p * GCN N = min Z k ∈R h×ac n i=1 F    P k=1    trace(P a D (i) k X i Z (1) k ) . . . trace(P a D (i) k X i Z (c) k )       + β P k=1 ∥Z k ∥ * . In order to compute the Burer-Monteiro factorization, we factor Z k = U k V ⊤ k , where U k ∈ R h×m , V k ∈ R ac×m . Then, with V (c ′ ) k ∈ R a×m . Then for each k,    trace(P a D (i) k X Z (1) k ) . . . trace(P a D (i) k X i Z (c) k )    =     trace(P a D (i) k X i U k V (1) k ⊤ ) . . . trace(P a D (i) k X i U k V (c) k ⊤ )     =     trace(V (1) k ⊤ P a D (i) k X i U k ) . . . trace(V (c) k ⊤ P a D (i) k X i U k )     =     m j=1 v (1) jk ⊤ P a D (i) k X i u jk . . . m j=1 v (c) jk ⊤ P a D (i) k X i u jk     = m j=1 V ⊤ jk P a D (i) k X i u jk , where V ⊤ jk :=    v (1) jk ⊤ • • • v (c) jk ⊤    ∈ R c×a . ( ) The equivalent Burer-Monteiro formulation thus is given by p * GSA = min p * GCN N = min {{u jk ∈R h } m j=1 } P k=1 {{V jk ∈R c×a } m j=1 } P k=1 n i=1 F   P k=1 m j=1 V jk P a D (i) k X i u jk   + β 2 P k=1 m j=1 ∥u jk ∥ 2 F + ∥V jk ∥ 2 F . U jk V jk n i=1 F   P k=1 m j=1 diag -1 (D (i) k ) ⊙ (X i U jk X ⊤ i ) X i V jk   (137) + β 2 P k=1 m j=1 ∥U jk ∥ 2 F + ∥V jk ∥ 2 F p * RSA = min U jk V jk n i=1 F   P k=1 m j=1 diag -1 (D (i) k ) ⊙ (X i U jk X ⊤ i ) X i V jk   (138) + β 2 P k=1 m j=1 ∥U jk ∥ 2 F + ∥V jk ∥ 2 F s.t. (2diag -1 (D (i) k ) -11 ⊤ ) ⊙ (X i U jk X ⊤ i ) ≥ 0, where diag -1 (D (i) k ) ∈ R s×s takes elements along the diagonal of D (i) k and places them in matrix form. Proof. Courtesy of (Sahiner et al., 2022) , we first present the equivalent convex models for gated ReLU and ReLU activation self attention. First, define X := X 1 ⊗ X 1 • • • X n ⊗ X n {D j } P j=1 := {diag (1{Xu ≥ 0})} P j=1 , then, we have p * GSA = min Zj ∈R d 2 ×dc n i=1 L   P j=1 d k=1 d ℓ=1 G (k,ℓ) i,j X i Z (k,ℓ) j , Y i   + β P j=1 ∥Z j ∥ * , p * RSA = min Zj ∈R d 2 ×dc n i=1 L   P j=1 d k=1 d ℓ=1 G (k,ℓ) i,j X i Z (k,ℓ) j , Y i   + β P j=1 ∥Z j ∥ Kj , * , where G i,j := (X i ⊗ I s ) ⊤ D (i) j (X i ⊗ I s ), for G (k,ℓ) i,j ∈ R s×s and Z (k,ℓ) j ∈ R d×c . We then proceed to take the Burer-Monteiro factorization of these models. Here, we will show the Burer-Monteiro factorization of the ReLU model, noting that the proof is the same for the Gated ReLU model sans the constraints. We let Z j = U j V ⊤ j , where U j ∈ R d 2 ×m and V j ∈ R dc×m , where K j U j ≥ 0. Let vec -1 (u jx ) ∈ R d×d be the result of taking chunks of d-length vectors from u jx for j ∈ [m] and stacking them in columns. Similarly, let vec -1 (v jx ) ∈ R c×d be the result of taking chunks of c-length vectors from v jx and stacking them in columns. Furthermore, we will let vec -1 (u jx ) k be the kth column of vec -1 (u jx ). Then, recognize that Z j (k,ℓ) = m x=1 vec -1 (u jx ) k vec -1 (v jx ) ⊤ ℓ . Thus, for each j, d k=1 d ℓ=1 G (k,ℓ) i,j X i Z (k,ℓ) j = m x=1 d k=1 d ℓ=1 G (k,ℓ) i,j X i vec -1 (u jx ) k vec -1 (v jx ) ⊤ ℓ = m x=1 d k=1 d ℓ=1 (X i ⊗ I s ) ⊤ D (i) j (X i ⊗ I s ) (k,ℓ) X i vec -1 (u jx ) k vec -1 (v jx ) ⊤ ℓ = m x=1 d k=1 d ℓ=1 (X i [•, k] ⊗ I s ) ⊤ D (i) j (X i [•, ℓ] ⊗ I s )X i vec -1 (u jx ) k vec -1 (v jx ) ⊤ ℓ = s y=1 m x=1 d k=1 d ℓ=1 D (i) (y,y) j X i vec -1 (u jx ) k vec -1 (v jx ) ⊤ ℓ X i [y, k]X i [y, ℓ] = m x=1 diag -1 (D (i) j ) ⊙ (X i U jx X ⊤ i ) X i V jx , where the constraint that (2D (i) j -I)XU j ≥ 0 can also be re-written as (2diag -1 (D (i) k ) -11 ⊤ ) ⊙ (X i U jx X ⊤ i ) ≥ 0 for all x ∈ [m] . Thus, we have proven the statement.

C EXPERIMENTAL DETAILS

In all cases, we solve the BM factorization with GD using Pytorch (Paszke et al., 2019 ) on a CPU with a momentum parameter of 0.9, a learning rate of 1.0 which decays by a factor of 0.9 whenever the training loss plateaus, and train for 20000 epochs such that GD always converges. We also provide some additional experimental results not presented in the main paper in order to supplement our submission. We consider the task of leveraging the theory of two-layer convex ReLU neural networks for training deep image classifiers. In particular, following the approach of (Belilovsky et al., 2019) , we seek to train two-layer convex CNNs greedily to mimick the performance of a deep network. In the non-convex setting, the greedy approach proceeds by training a single two-layer CNN, then freezing the weights of this CNN, using the latent representation of this CNN as the input features for another two-layer CNN, and repeating this process for a specified number of stages. We leverage the result of Theorem 3.6 to convert this non-convex layerwise training procedure to a convex one, training stages of convex two-layer gated ReLU CNNs with average pooling. We apply this procedure to the image classification datasets of CIFAR-10 (Krizhevsky et al., 2009) and Fashion-MNIST (Xiao et al., 2017) , following all general architecture choices from (Belilovsky et al., 2019) (see Appendix for details). We model the scenario in which we are in a memory-limited setting, restricting our memory to a single 12GB GPU. In this memory-limited setting, layerwise training with the full convex model is impossible, because the nuclear norm penalty requires an SVD at each iteration, and further because the latent representation to be used as input for the second stage, given by {{D (i) j X i Z (c ′ ) j )} P j=1 } c c ′ =1 , has P ac channels, which for reasonable choices of P = 256, a = 4, c = 10 yields upwards of 10 4 channels for the input to the second CNN stage. Accordingly, for this model, we employ the BM factorization with m = 1, allowing for a latent representation for the second stage consisting of only P channels. As we show in Figure 3 , this BM scheme for layerwise training allows for both test and train accuracies to improve one stage the next of the layerwise training procedure, reaching the performance of much deeper networks while enabling a convex optimization procedure. In particular, training five stages of a BM factorized convex two-layer gated ReLU CNN on CIFAR-10 resulted in a final test accuracy of 80.5%. Previously, it has been demonstrated that a six-layer ReLU CNN achieves 81.6% on CIFAR-10 when trained end-to-end (Kiliçarslan & Celik, 2021) . In addition, the three-stage trained BM factorized convex two-layer gated ReLU CNN on Fasion-MNIST achieved a final test accuracy of 91.5%, compared to 91.2% for a four-layer ReLU CNN trained end-to-end (Bhatnagar et al., 2017) . These results demonstrate that the BM factorization is essential for convex neural networks to match the performance of deep end-to-end trained ReLU networks. Our layerwise training procedure was trained on a single NVIDIA 1080 Ti GPU using the Pytorch deep learning library (Paszke et al., 2019) . In particular, we follow the implementation of (Belilovsky et al., 2019) , who proposed greedily, sequentially training two-layer CNNs. At each stage, a two-layer CNN (convolutional layer + average pooling + fully connected layer) is trained, and then the weights are frozen, the fully connected layer and average pooling are discarded, and the trained convolutional layer is used as a feature-generator for the following stage. At certain stages, before the CNN is applied, an invertible downsampling operation (Dinh et al., 2016) is used to reduce the spatial dimensions of the image. In (Belilovsky et al., 2019) , m = 128 convolutional features per stage are used, with ReLU activations as well as an average pooling operation to spatial dimensions of 2 × 2 (i.e. a = 4) is used, followed by a flattening operation and a fully connected layer. They also use a softmax cross-entropy loss, a batch size of 128, weight decay parameter of β = 5e -4, along with stochastic gradient descent (SGD) with momentum fixed to 0.9, 50 epochs per stage, and learning rate decay by a factor of 0.2 every 15 epochs. In our experiments, we keep all network and optimization parameters the same, aside from replacing the non-convex CNN at each stage with our convex CNN objective (26). We then apply the Burer-Monteiro factorization with m = 1 to this architecture to make it tractable for layerwise learning as described in the main paper. At each stage, we randomly subsample P = 256 hyperplane arrangements, rather than enumerate over all P , which is a high-order polynomial in terms of n. We further use gated ReLU rather than ReLU activations for simplicity, which can work as well as ReLU in practice (Fiat et al., 2019) . These techniques have been used effectively for convex learning to exceed the performance of two-layer non-convex neural networks (Pilanci & Ergen, 2020; Ergen & Pilanci, 2020; Ergen et al., 2021) . For the CIFAR-10 experiment, we use 5 stages (following (Belilovsky et al., 2019) ), whereas for the Fashion-MNIST experiment, we use 3 stages, since the training accuracy saturates after 3 stages. We choose learning rates per stage from {10 -1 , 10 -2 , 10 -3 , 10 -4 } per stage based on training accuracy for CIFAR-10. The chosen learning rates were [10 -1 , 10 -2 , 10 -3 , 10 -2 , 10 -2 ] for CIFAR-10. For Fashion-MNIST, we empirically observed the training loss was better optimized with slightly higher learning rates, so we used [2 × 10 -1 , 5 × 10 -2 , 5 × 10 -3 ]. All code used to run our experiments is provided. Ultimately, our CIFAR-10 network with 5 stages took 9163 seconds to train, and the Fashion-MNIST network with 3 stages took 4931 seconds to train. In (Kiliçarslan & Celik, 2021) , it is shown that an end-to-end 6-layer CNN with ReLU activations takes 640 seconds to train on CIFAR-10 and 285 seconds to train on Fasion-MNIST. We note that the purpose of this experiment is not to advocate for the use of layerwise BM networks over end-to-end trained networks, but simply to demonstrate the utility of the BM network in enabling convex neural networks to scale to the performance of end-to-end deep networks by using layerwise learning.



We examine linear and gated ReLU activations for CNNs in the Appendix. We examine gated ReLU and ReLU activations for self-attention in the Appendix. We find that GD applied to the BM factorization finds "subtle" saddle points: not quite local minima, but close. Interestingly, there is only a minor relationship between the optimality gap and the rank of the BM factorization m. While our optimality gap bound for m = 1 is larger than larger values of m for small β, the actual optimality gap is nearly identical across m. This experiment further validates the need to consider stationary points of the BM factorization, rather than just local minima, to fully characterize the BM factorization for efficient solutions to convex problems.5 CONCLUSIONWe are the first to adapt the Burer-Monteiro (BM) factorization for two-layer convex neural networks with linear and ReLU activations, which offers new insights on their global optima. We provide a novel relative optimality bound on stationary point of the BM factorization, which provides a condition whose satisfaction guarantees a globally optimal solution.



3, 3.4, and 3.5, noting the equivalence of the BM factorization with the original non-convex program (9), we are the first to show conditions under which there are no spurious local minima for self-attention networks. Corollary 3.8.1. The linear-activation self-attention network (29) has no spurious local minima as long as the number of heads satisfies m ≥ m * where m * ≤ min{d 2 , dc}. Furthermore, for any twice-differentiable objective F , if for any local minimum ( Ûj , Vj ) m j=1 of (29), the matrix

(a) n = 45 (b) n = 150

Figure 1: Example of three-class spiral dataset, with different number of samples n.

Figure 2: Relative optimality gap of the non-convex BM factorization of a gated-ReLU two-layer MLP for three-class spiral data classification (d = 2, c = 3). For fixed values of n, we demonstrate how β and m affect relative optimality gap, both in in terms of the proposed bound and the actual gap, where the global minimum is determined by convex optimization.

beyond simply weight-decay regularization, and to arbitrary regularization. Theorem B.1. The non-convex ReLU training objective p * := min w1j ,w2j

SELF-ATTENTION Lemma B.3. The Burer-Monteiro factorization of the convex self-attention problem with gated ReLU and ReLU activations are given as follows.

Figure 3: BM enables layerwise training of convex gated ReLU CNNs, which are competitive with end-to-end ReLU networks of similar depth.For CIFAR-10, we achieve a test accuracy of 80.5% compared to 81.6% for end-to-end non-convex training, and for Fashion-MNIST we achieve a test accuracy of 91.5% compared to 91.2% for end-to-end non-convex training(Kiliçarslan & Celik, 2021;Bhatnagar et al., 2017).

Lingxiao Wang, Xiao Zhang, and Quanquan Gu. A unified computational and statistical framework for nonconvex low-rank matrix estimation. In Artificial Intelligence and Statistics, pp. 981-990. PMLR, 2017. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017. Tian Ye and Simon S Du. Global convergence of gradient descent for asymmetric low-rank matrix factorization. Advances in Neural Information Processing Systems, 34, 2021. Qinqing Zheng and John Lafferty. Convergence analysis for rectangular matrix completion using burer-monteiro factorization and gradient descent. arXiv preprint arXiv:1605.07051, 2016. Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B Wakin. The global optimization geometry of low-rank matrix optimization. arXiv preprint arXiv:1703.01256, 2017. Liu Ziyin, Botao Li, James B Simon, and Masahito Ueda. Sgd can converge to local maxima. In International Conference on Learning Representations, 2021.

A PROOFS

A.1 RESULT OF (BURER & MONTEIRO, 2005) AND ITS APPLICATIONS TO RECTANGULAR

MATRICES

In this subsection, we outline the precise theoretical statement of (Burer & Monteiro, 2005) and describe exactly how it corresponds to our summary in Section 2.2, and thus the application to the later derivations in our work. We first describe the following result, without proof, from (Burer & Monteiro, 2005) . Lemma A.1 (Lemma 2.1 of (Burer & Monteiro, 2005) ). Suppose R ∈ R (d+c)×r , S ∈ R (d+c)×r satisfy RR ⊤ = SS ⊤ . Then, S = RQ for some orthogonal Q ∈ R r×r . Now, we proceed to prove analogs of Theorem 2.3 of (Burer & Monteiro, 2005) for general SDPs with a rank constraint. Lemma A.2 (Analog of Lemma 2.2 of (Burer & Monteiro, 2005) ). Consider the problem minR is a local minimum of (31) if and only if RQ is a local minimum of (31) for all orthogonal Q ∈ R r×r .Proof. Since Q is orthogonal, ( RQ)( RQ) ⊤ = RQQ ⊤ R ⊤ = RR ⊤ . Thus, R′ := RQ attains the same objective value, gradients, and higher order derivatives as R. Thus, R is a local minimum if and only if R′ is a local minimum.Theorem A.3 (Analog of Theorem 2.3 of (Burer & Monteiro, 2005) ). Consider the following two problems. minminThen, for any continuous function F ′ , a feasible solution X is a local minimizer of (32) if and only if, for X = R R⊤ , R is a local minimizer of (33).Proof. We follow the exact same lines as (Burer & Monteiro, 2005) . By continuity of the map R → RR ⊤ , we know that if X is a local minimizer of (32), then R is a local minimizer of (33). Now, we must prove the other direction, namely that if X = R R⊤ is not local minimizer of (32), then R is not a local minimizer of (33).Suppose that X is not a local minimum. By continuity of F ′ , then, there must be a sequence of feasible solutionswe see that R is not a local minimum of (33).Using the fact that X = R R⊤ = RR ⊤ together with Lemmas A.1 and A.2, we conclude that R is not a local minimum of (33).With this established, we now describe how this theorem applies to the setting described in Section 2.2, i.e. the rectangular matrix, non-SDP case. Lemma A.4. Consider the solution X * to p * 1 := minand define m * := rank(X * ). Then, for any m ≥ m * , (34) is equivalent toi.e. p * 1 = p * 2 , and the solutions to ( 35) and (34) are identical.Proof. Clearly, adding a rank constraint to (34) can only increase the objective value, so p * 2 ≥ p * 1 . However, since the optimal solution to (34) satisfies the rank constraint for any m ≥ m * , every optimal solution of (34) maps to a feasible point for (35) that obtains the same objective, so p * 2 ≤ p * 1 . Putting these together, we have p * 2 = p * 1 , and the solutions are identical. Lemma A.5. Consider the solution X * to (34) with m * := rank(X * ). Then, for any convex, continuous function F ′ , the global minimizer X * = R R⊤ corresponds to a local minimizer R of minas long as m ≥ m * .Proof. We simply use the result from Lemma A.4 and apply the equivalence to Theorem A.3, noting that if F ′ is convex, all local minimizers of ( 34) are global optimizers.Lemma A.6. Consider the optimization problemDefineThen, (37) is equivalent to (34), meaning that p * 1 = p * 3 and their solutions map to each other.Proof. For any solution X * to (34), we can simply form a solution Z * to (37) by letting Z * = X * 2 , so clearly p * 1 ≥ p * 3 . For any solution Z * to (37), factor Z * = U * V * ⊤ e.g. with SVD. Then, let R * = U * V * and form a solution to (34) as X * = R * R * ⊤ . Clearly, X * ⪰ 0, so X * is feasible, and the objective value is given by FPutting these two together, we conclude p * 1 = p * 3 and the solutions of the two problems map to each other. Lemma A.7 (Used in Section 2.2). Consider the optimization problem (37), with optimal solution Z * with m * := rank(Z * ). Then, for any convex, continuous function F ′ , the solution Z * = Û V⊤ corresponds to a local minimum ( Û, V) ofas long as m ≥ m * .Proof. Define F ′ : R (d+c)×(d+c) → R such thatfor X 1 ∈ R d×d , X 2 ∈ R d×c , X 3 ∈ R c×d , X 4 ∈ R c×c . Then, let R = U V ∈ R (d+c)×m . One can re-write F (UV ⊤ ) as F ′ (RR ⊤ ). Then, we see that (39) can be expressed as (36), and any local minimizer to (36) is a local minimizer to (39).

