LOCAL SEARCH ALGORITHMS FOR RANK-CONSTRAINED CONVEX OPTIMIZATION

Abstract

We propose greedy and local search algorithms for rank-constrained convex optimization, namely solving min rank(A)≤r * R(A) given a convex function R : R m×n → R and a parameter r * . These algorithms consist of repeating two steps: (a) adding a new rank-1 matrix to A and (b) enforcing the rank constraint on A. We refine and improve the theoretical analysis of Shalev-Shwartz et al. (2011), and show that if the rank-restricted condition number of R is κ, a solution A with rank O(r * • min{κ log R(0)-R(A * ) , κ 2 }) and R(A) ≤ R(A * ) + can be recovered, where A * is the optimal solution. This significantly generalizes associated results on sparse convex optimization, as well as rank-constrained convex optimization for smooth functions. We then introduce new practical variants of these algorithms that have superior runtime and recover better solutions in practice. We demonstrate the versatility of these methods on a wide range of applications involving matrix completion and robust principal component analysis.

1. INTRODUCTION

Given a real-valued convex function R : R m×n → R on real matrices and a parameter r * ∈ N, the rank-constrained convex optimization problem consists of finding a matrix A ∈ R m×n that minimizes R(A) among all matrices of rank at most r * : min rank(A)≤r * R(A) Even though R is convex, the rank constraint makes this problem non-convex. Furthermore, it is known that this problem is NP-hard and even hard to approximate (Natarajan (1995); Foster et al. (2015) ). In this work, we propose efficient greedy and local search algorithms for this problem. Our contribution is twofold: 1. We provide theoretical analyses that bound the rank and objective value of the solutions returned by the two algorithms in terms of the rank-restricted condition number, which is the natural generalization of the condition number for low-rank subspaces. The results are significantly stronger than previous known bounds for this problem. 2. We experimentally demonstrate that, after careful performance adjustments, the proposed general-purpose greedy and local search algorithms have superior performance to other methods, even for some of those that are tailored to a particular problem. Thus, these algorithms can be considered as a general tool for rank-constrained convex optimization and a viable alternative to methods that use convex relaxations or alternating minimization. The rank-restricted condition number Similarly to the work in sparse convex optimization, a restricted condition number quantity has been introduced as a reasonable assumption on R. If we let ρ + r be the maximum smoothness bound and ρ - r be the minimum strong convexity bound only along rank-r directions of R (these are called rank-restricted smoothness and strong convexity respectively), the rank-restricted condition number is defined as κ r = ρ + r ρ - r . If this quantity is bounded, one can efficiently find a solution A with R(A) ≤ R(A * ) + and rank r = O(r * • κ r+r * R(0) ) using a greedy algorithm (Shalev-Shwartz et al. (2011) ). However, this is not an ideal bound since the rank scales linearly with R(0) , which can be particularly high in practice. Inspired by the analogous literature on sparse convex optimization by Natarajan (1995) ; Shalev-Shwartz et al. (2010) ; Zhang (2011); Jain et al. (2014) and more recently Axiotis & Sviridenko (2020) , one would hope to achieve a logarithmic dependence or no dependence at all on R(0) . In this paper we achieve this goal by providing an improved analysis showing that the greedy algorithm of Shalev-Shwartz et al. (2011) in fact returns a matrix of rank of r = O(r * • κ r+r * log R(0) ). We also provide a new local search algorithm together with an analysis guaranteeing a rank of r = O(r * • κ 2 r+r * ). Apart from significantly improving upon previous work on rank-restricted convex optimization, these results directly generalize a lot of work in sparse convex optimization, e.g. Natarajan (1995) ; Shalev-Shwartz et al. (2010) ; Jain et al. (2014) . Our algorithms and theorem statements can be found in Section 2. Runtime improvements Even though the rank bound guaranteed by our theoretical analyses is adequate, the algorithm runtimes leave much to be desired. In particular, both the greedy algorithm of Shalev-Shwartz et al. (2011) and our local search algorithm have to solve an optimization problem in each iteration in order to find the best possible linear combination of features added so far. Even for the case that R (A) = 1 2 (i,j)∈Ω (M -A) 2 ij ,

this requires solving a least squares problem on |Ω|

examples and r 2 variables. For practical implementations of these algorithms, we circumvent this issue by solving a related optimization problem that is usually much smaller. This instead requires solving n least squares problems with total number of examples |Ω|, each on r variables. This not only reduces the size of the problem by a factor of r, but also allows for a straightforward distributed implementation. Interestingly, our theoretical analyses still hold for these variants. We propose an additional heuristic that reduces the runtime even more drastically, which is to only run a few (less than 10) iterations of the algorithm used for solving the inner optimization problem. Experimental results show that this modification not only does not significantly worsen results, but for machine learning applications also acts as a regularization method that can dramatically improve generalization. These matters, as well as additional improvements for making the local search algorithm more practical, are addressed in Section 2.3.

Roadmap

In Section 2, we provide the descriptions and theoretical results for the algorithms used, along with several modifications to boost performance. In Section 3, we evaluate the proposed greedy and local search algorithms on optimization problems like robust PCA. Then, in Section 4 we evaluate their generalization performance in machine learning problems like matrix completion.

2. ALGORITHMS & THEORETICAL GUARANTEES

In Sections 2.1 and 2.2 we state and provide theoretical performance guarantees for the basic greedy and local search algorithms respectively. Then in Section 2.3 we state the algorithmic adjustments that we propose in order to make the algorithms efficient in terms of runtime and generalization performance. A discussion regarding the tightness of the theoretical analysis is deferred to Appendix A.4. When the dimension is clear from context, we will denote the all-ones vector by 1, and the vector that is 0 everywhere and 1 at position i by 1 i . Given a matrix A, we denote by im(A) its column span. One notion that we will find useful is that of singular value thresholding. More specifically, given a rank-k matrix A ∈ R m×n with SVD k i=1 σ i u i v i such that σ 1 ≥ • • • ≥ σ k , as well as an integer parameter r ≥ 1, we define H r (A) = r i=1 σ i u i v i to be the operator that truncates to the r highest singular values of A.

2.1. GREEDY

Algorithm 1 (Greedy) was first introduced in Shalev- Shwartz et al. (2011) as the GECO algorithm. It works by iteratively adding a rank-1 matrix to the current solution. This matrix is chosen as the rank-1 matrix that best approximates the gradient, i.e. the pair of singular vectors corresponding to the maximum singular value of the gradient. In each iteration, an additional procedure is run to optimize the combination of previously chosen singular vectors. In Shalev-Shwartz et al. (2011) guarantee on the rank of the solution returned by the algorithm is r * κ r+r * R(0) . The main bottleneck in order to improve on the R(0) factor is the fact that the analysis is done in terms of the squared nuclear norm of the optimal solution. As the worst-case discrepancy between the squared nuclear norm and the rank is R(0)/ , their bounds inherit this factor. Our analysis works directly with the rank, in the spirit of sparse optimization results (e.g. Shalev-Shwartz et al. (2011); Jain et al. (2014) ; Axiotis & Sviridenko (2020) ). A challenge compared to these works is the need for a suitable notion of "intersection" between two sets of vectors. The main technical contribution of this work is to show that the orthogonal projection of one set of vectors into the span of the other is such a notion, and, based on this, to define a decomposition of the optimal solution that is used in the analysis. Algorithm 1 Greedy 1: procedure GREEDY(r ∈ N : target rank) 2: function to be minimized R : R m×n → R 3: U ∈ R m×0 Initially rank is zero 4: V ∈ R n×0 5: for t = 0 . . . r -1 do 6: σuv ← H 1 (∇R(U V )) Max singular value σ and corresp. singular vectors u, v 7: U ← (U u) Append new vectors as columns 8: V ← (V v) 9: U, V ← OPTIMIZE(U, V ) 10: return U V 11: procedure OPTIMIZE(U ∈ R m×r , V ∈ R n×r ) 12: X ← arg min X∈R r×r R(U XV ) 13: return U X, V Theorem 2.1 (Algorithm 1 (greedy) analysis). Let A * be any fixed optimal solution of (1) for some function R and rank bound r * , and let > 0 be an error parameter. For any integer r ≥ 2r * • κ r+r * log R(0)-R(A * ) , if we let A = GREEDY(r) be the solution returned by Algorithm 1, then R(A) ≤ R(A * ) + . The number of iterations is r. The proof of Theorem 2.1 can be found in Appendix A.2.

2.2. LOCAL SEARCH

One drawback of Algorithm 1 is that it increases the rank in each iteration. Algorithm 2 is a modification of Algorithm 1, in which the rank is truncated in each iteration. The advantage of Algorithm 2 compared to Algorithm 1 is that it is able to make progress without increasing the rank of A, while Algorithm 1 necessarily increases the rank in each iteration. More specifically, because of the greedy nature of Algorithm 1, some rank-1 components that have been added to A might become obsolete or have reduced benefit after a number of iterations. Algorithm 2 is able to identify such candidates and remove them, thus allowing it to continue making progress. Theorem 2.2 (Algorithm 2 (local search) analysis). Let A * be any fixed optimal solution of (1) for some function R and rank bound r * , and let > 0 be an error parameter. For any integer r ≥ r * • (1 + 8κ 2 r+r * ), if we let A = LOCAL SEARCH(r) be the solution returned by Algorithm 2, then R(A) ≤ R(A * ) + . The number of iterations is O r * κ r+r * log R(0)-R(A * ) . The proof of Theorem 2.2 can be found in Appendix A.3. Algorithm 2 Local Search 1: procedure LOCAL SEARCH(r ∈ N : target rank) 2: function to be minimized R : R m×n → R 3: U ← 0 m×r Initialize with all-zero solution 4: V ← 0 n×r 5: for t = 0 . . . L -1 do Run for L iterations 6: σuv ← H 1 (∇R(U V )) Max singular value σ and corresp. singular vectors u, v 7: U, V ← TRUNCATE(U, V ) Reduce rank of U V by one 8: U ← (U u) Append new vectors as columns 9: V ← (V v) 10: U, V ← OPTIMIZE(U, V ) 11: return U V 12: procedure TRUNCATE(U ∈ R m×r , V ∈ R n×r ) 13: U ΣV ← SVD(H r-1 (U V )) Keep all but minimum singular value 14: return U Σ, V

2.3. ALGORITHMIC ADJUSTMENTS

Inner optimization problem The inner optimization problem that is used in both greedy and local search is: min X∈R r×r R(U XV ) . (2) It essentially finds the choice of matrices U and V , with columns in the column span of U and V respectively, that minimizes R(U V ). We, however, consider the following problem instead: min V ∈R n×r R(U V ) . Note that the solution recovered from (3) will never have worse objective value than the one recovered from (2), and that nothing in the analysis of the algorithms breaks. Importantly, (3) can usually be solved much more efficiently than (2). As an example, consider the following objective that appears in matrix completion: R(A) = 1 2 (i,j)∈Ω (M -A) 2 ij for some Ω ⊆ [m] × [n]. If we let Π Ω (•) be an operator that zeroes out all positions in the matrix that are not in Ω, we have ∇R(A) = -Π Ω (M -A). The optimality condition of (2) now is U Π Ω (M -U XV )V = 0 and that of (3) is U Π Ω (M -U V ) = 0. The former corresponds to a least squares linear regression problem with |Ω| examples and r 2 variables, while the latter can be decomposed into n independent systems U i:(i,j)∈Ω 1 i 1 i U V j = U Π Ω (M 1 j ) , where the variable is V j which is the j-th column of V . The j-th of these systems corresponds to a least squares linear regression problem with |{i : (i, j) ∈ Ω}| examples and r variables. Note that the total number of examples in all systems is j∈[n] |{i : (i, j) ∈ Ω}| = |Ω|. The choice of V here as the variable to be optimized is arbitrary. In particular, as can be seen in Algorithm 3, in practice we alternate between optimizing U and V in each iteration. It is worthy of mention that the OPTIMIZE FAST procedure is basically the same as one step of the popular alternating minimization procedure for solving low-rank problems. As a matter of fact, when our proposed algorithms are viewed from this lens, they can be seen as alternating minimization interleaved with rank-1 insertion and/or removal steps.

Singular value decomposition

As modern methods for computing the top entries of a singular value decomposition scale very well even for large sparse matrices (Martinsson et al. (2011); Szlam et al. (2014) ; Tulloch (2014)), the "insertion" step of greedy and local search, in which the top entry of the SVD of the gradient is determined, is quite fast in practice. However, these methods are not suited for computing the smallest singular values and corresponding singular vectors, a step required for the local search algorithm that we propose. Therefore, in our practical implementations we opt to perfom the alternative step of directly removing one pair of vectors from the representation U V . A simple approach is to go over all r possible removals and pick the one that increases the Algorithm 3 Fast inner Optimization 1: procedure OPTIMIZE FAST(U ∈ R m×r , V ∈ R n×r , t ∈ N : iteration index of algorithm) 2: if t mod 2 = 0 then 3: X ← arg min X∈R m×r R(XV ) 4: return X, V 5: else 6: X ← arg min X∈R n×r R(U X ) 7: return U, X objective by the least amount. A variation of this approach has been used by Shalev-Shwartz et al. (2011) . However, a much faster approach is to just pick the pair of vectors U 1 i , V 1 i that minimizes U 1 i 2 V 1 i 2 . This is the approach that we use, as can be seen in Algorithm 4. Algorithm 4 Fast rank reduction 1: procedure TRUNCATE FAST(U ∈ R m×r , V ∈ R n×r ) 2: i ← arg min i∈[r] U 1 i 2 V 1 i 2 3: return U [m],[1,i-1] U [m],[i+1,r] , V [n],[1,i-1] V [n],[i+1,r] Remove column i After the previous discussion, we are ready to state the fast versions of Algorithm 1 and Algorithm 2 that we use for our experiments. These are Algorithm 2.3 and Algorithm 5. Notice that we initialize Algorithm 5 with the solution of Algorithm 2.3 and we run it until the value of R(•) stops decreasing rather than for a fixed number of iterations. Algorithm 2.3 (Fast Greedy). The Fast Greedy algorithm is defined identically as Algorithm 1, with the only difference that it uses the OPTIMIZE FAST routine as opposed to the OPTIMIZE routine.  U prev , V prev ← U, V 6: σuv ← H 1 (∇R(U V )) Max singular value σ and corresp. singular vectors u, v 7: U, V ← TRUNCATE FAST(U, V ) Reduce rank of U V by one 8: U ← (U u) Append new vectors as columns 9: V ← (V v) 10: U, V ← OPTIMIZE FAST(U, V, t) 11: while R(U V ) < R(U prev V prev ) 12: return U prev V prev

3. OPTIMIZATION APPLICATIONS

An immediate application of the above algorithms is in the problem of low rank matrix recovery. Given any convex distance measure between matrices d : R m×n × R m×n → R ≥0 , the goal is to find a low-rank matrix A that matches a target matrix M as well as possible in terms of d: min rank(A)≤r * d(M, A) This problem captures a lot of different applications, some of which we go over in the following sections.

3.1. LOW-RANK APPROXIMATION ON OBSERVED SET

A particular case of interest is when d(M, A) is the Frobenius norm of M -A, but only applied to entries belonging to some set Ω. In other words, d(M, A) = 1 2 Π Ω (M -A) 2 F . We have compared our Fast Greedy and Fast Local Search algorithms with the SoftImpute algorithm of Mazumder et al. (2010) as implemented by Rubinsteyn & Feldman (2016) , on the same experiments as in Mazumder et al. (2010) . We have solved the inner optimization problem required by our algorithms by the LSQR algorithm Paige & Saunders (1982) . More specifically, M = U V + η ∈ R 100×100 , where η is some noise vector. We let every entry of U, V, η be i. F / Π Ω (M ) 2 F . The results can be seen in Figure 1 , where it is illustrated that Fast Local Search sometimes returns significantly more accurate and lower-rank solutions than Fast Greedy, and Fast Greedy generally returns significantly more accurate and lower-rank solutions than SoftImpute. 

3.2. ROBUST PRINCIPAL COMPONENT ANALYSIS (RPCA)

The robust PCA paradigm asks one to decompose a given matrix M as L + S, where L is a lowrank matrix and S is a sparse matrix. This is useful for applications with outliers where directly computing the principal components of M is significantly affected by them. For a comprehensive survey on Robust PCA survey one can look at Bouwmans et al. (2018) . The following optimization problem encodes the above-stated requirements: min rank(L)≤r * M -L 0 (4) where X 0 is the sparsity (i.e. number of non-zeros) of X. As neither the rank constraint or the 0 function are convex, Candès et al. ( 2011) replaced them by their usual convex relaxations, i.e. the nuclear norm • * and 1 norm respectively. However, we opt to only relax the 0 function but not the rank constraint, leaving us with the problem: min rank(L)≤r * M -L 1 (5) In order to make the objective differentiable and thus more well-behaved, we further replace the 1 norm by the Huber loss H δ (x) = x 2 /2 if |x| ≤ δ δ|x| -δ 2 /2 otherwise , thus getting: min rank(L)≤r * ij H δ (M - L) ij . This is a problem on which we can directly apply our algorithms. We solve the inner optimization problem by applying 10 L-BFGS iterations. In Figure 2 2010), where we tuned the regularization parameter λ to achieve the best result. We find that Fast Greedy has the best performance out of the three algorithms in this sample task. Figure 2 : Foreground-background separation from video. From left to right: Fast Greedy with rank=3 and Huber loss with δ = 20. Standard PCA with rank=1. Principal Component Pursuit (PCP) with λ = 0.002. Both PCA and PCP have visible "shadows" in the foreground that appear as "smudges" in the background. These are less obvious in a still frame but more apparent in a video.

4. MACHINE LEARNING APPLICATIONS

4.1 REGULARIZATION TECHNIQUES In the previous section we showed that our proposed algorithms bring down different optimization objectives aggressively. However, in applications where the goal is to obtain a low generalization error, regularization is needed. We considered two different kinds of regularization. The first method is to run the inner optimization algorithm for less iterations, usually 2-3. Usually this is straightforward since an iterative method is used. For example, in the case R(A) = 1 2 Π Ω (M -A) 2 F the inner optimization is a least squares linear regression problem that we solve using the LSQR algorithm. The second one is to add an 2 regularizer to the objective function. However, this option did not provide a substantial performance boost in our experiments, and so we have not implemented it.

4.2. MATRIX COMPLETION WITH RANDOM NOISE

In this section we evaluate our algorithms on the task of recovering a low rank matrix U V after observing Π Ω (U V + η), i.e. a fraction of its entries with added noise. As in Section 3.1, we use the setting of Mazumder et al. (2010) and compare with the SoftImpute method. The evaluation metric is the normalized MSE, defined as ( (i,j) / ∈Ω (U V -A) 2 ij )/( (i,j) / ∈Ω (U V ) 2 ij ) , where A is the predicted matrix and U V the true low rank matrix. A few example plots can be seen in Figure 3 and a table of results in Table 1 . We have implemented the Fast Greedy and Fast Local Search algorithms with 3 inner optimization iterations. In the first few iterations there is a spike in the relative MSE of the algorithms that use the OPTIMIZE FAST routine. We attribute this to the aggressive alternating minimization steps of this procedure and conjecture that adding a regularization term to the objective might smoothen the spike. However, the Fast Local Search algorithm still gives the best overall performance in terms of how well it approximates the true low rank matrix U V , and in particular with a very small rank-practically the same as the true underlying rank. 

4.3. RECOMMENDER SYSTEMS

In this section we compare our algorithms on the task of movie recommendation on the Movielens datasets Harper & Konstan (2015) . In order to evaluate the algorithms, we perform random 80%-20% train-test splits that are the same for all algorithms and measure the mean RMSE in the test set. If we let Ω ⊆ [m] × [n] be the set of user-movie pairs in the training set, we assume that the true user-movie matrix is low rank, and thus pose (1) with R(A) = 1 2 Π Ω (M -A) 2 F . We make the following slight modification in order to take into account the range of the ratings [1, 5]: We clip the entries of A between 1 and 5 when computing ∇R(A) in Algorithm 2.3 and Algorithm 5. In other words, instead of Π Ω (A -M ) we compute the gradient as Π Ω (clip(A, 1, 5) -M ). This is similar to replacing our objective by a Huber loss, with the difference that we only do so in the steps that we mentioned and not the inner optimization step, mainly for runtime efficiency reasons. The results can be seen in Table 2 . We do not compare with Fast Local Search, as we found that it only provides an advantage for small ranks (< 30), and otherwise matches Fast Greedy. For the inner optimization steps we have used the LSQR algorithm with 2 iterations in the 100K and 1M datasets, and with 3 iterations in the 10M dataset. Note that even though the SVD algorithm by Koren et al. (2009) as implemented by Hug (2020) (with no user/movie bias terms) is a highly tuned algorithm for recommender systems that was one of the top solutions in the famous Netflix prize, it has comparable performance to our general-purpose Algorithm 2.3. Finally, Table 3 demonstrates the speedup achieved by our algorithms over the basic greedy implementation. It should be noted that the speedup compared to the basic greedy of Shalev-Shwartz et al. (2011) 2004)) that alternatively minimizes the left and right subspace, and also uses Frobenius norm regularization. For SoftImpute and Alternating Minimization we have found the best choice of parameters by performing a grid search over the rank and the multiplier of the regularization term. We have found the best choice of parameters by performing a grid search over the rank and the multiplier of the regularization term. We ran 20 iterations of Alternating Minimization in each case.

Algorithm

Figure 3 It is important to note that our goal here is not to be competitive with the best known algorithms for matrix completion, but rather to propose a general yet practically applicable method for rankconstrained convex optimization. For a recent survey on the best performing algorithms in the Movielens datasets see Rendle et al. (2019) . It should be noted that a lot of these algorithms have significant performance boost compared to our methods because they use additional features (meta information about each user, movie, timestamp of a rating, etc.) or stronger models (user/movie biases, "implicit" ratings). A runtime comparison with these recent approches is an interesting avenue for future work. As a rule of thumb, however, Fast Greedy has roughly the same runtime as SVD (Koren et al. (2009) ) in each iteration, i.e. O(|Ω|r), where Ω is the set of observable elements and r is the rank. As some better performing approaches have been reported to be much slower than SVD (e.g. SVD++ is reported to be 50-100x slower than SVD in the Movielens 100K and 1M datasets (Hug (2020) ), this might also suggest a runtime advantage of our approach compared to some better performing methods.

5. CONCLUSIONS

We presented simple algorithms with strong theoretical guarantees for optimizing a convex function under a rank constraint. Although the basic versions of these algorithms have appeared before, through a series of careful runtime, optimization, and generalization performance improvements that we introduced, we have managed to reshape the performance of these algorithms in all fronts. Via our experimental validation on a host of practical problems such as low-rank matrix recovery with missing data, robust principal component analysis, and recommender systems, we have shown that the performance in terms of the solution quality matches or exceeds other widely used and even specialized solutions, thus making the argument that our Fast Greedy and Fast Local Search routines can be regarded as strong and practical general purpose tools for rank-constrained convex optimization. Interesting directions for further research include the exploration of different kinds of regularization and tuning for machine learning applications, as well as a competitive implementation and extensive runtime comparison of our algorithms. Additionally, we have the following lemma regarding the optimality conditions of (2): Lemma A.8. Let A = U XV where U ∈ R m×r , X ∈ R r×r , and V ∈ R n×r , such that X is the optimal solution to (2). Then for any u ∈ im(U ) and v ∈ im(V ) we have that ∇R(A), uv = 0. Proof. By the optimality condition of 2, we have that U ∇R(A)V = 0 Now, for any u = U x and v = V y we have ∇R(A), uv = u ∇R(A)v = x U ∇R(A)V y = 0 We are now ready for the proof of Theorem 2.1. Proof. Let A t-1 be the current solution U V before iteration t -1 ≥ 0. Let u ∈ R m and v ∈ R m be left and right singular vectors of matrix ∇R(A), i.e. unit vectors maximizing | ∇R(A), uv |. Let B t = {B|B = A t-1 + ηuv T , η ∈ R}. By smoothness we have R(A t-1 ) -R(A t ) ≥ max B∈Bt {R(A t-1 ) -R(B)} ≥ max B∈Bt -∇R(A t-1 ), B -A t-1 - ρ + 1 2 B -A t-1 2 F ≥ max η η ∇R(A t-1 ), uv -η 2 ρ + 1 2 = max η η ∇R(A t-1 ) 2 -η 2 ρ + 1 2 = ∇R(A t-1 ) 2 2 2ρ + 1 where • 2 is the spectral norm (i.e. maximum magnitude of a singular value). On the other hand, by strong convexity and noting that rank(A * -A t-1 ) ≤ rank(A * ) + rank(A t-1 ) ≤ r * + r , R(A * ) -R(A t-1 ) ≥ ∇R(A t-1 ), A * -A t-1 + ρ - r+r * 2 A * -A t-1 2 F . ( ) Let A t-1 = U V and A * = U * V * . We let Π im(U ) = U (U U ) + U and Π im(V ) = V (V V ) + V denote the orthogonal projections onto the images of U and V respectively. We now write A * = U * V * = (U 1 + U 2 )(V 1 + V 2 ) = U 1 V 1 + U 1 V 2 + U 2 V * where U 1 = Π im(U ) U * is a matrix where every column of U * is replaced by its projection on im(U ) and U 2 = U * -U 1 and similarly V 1 = Π im(V ) V * is a matrix where every column of V * is replaced by its projection on im(V ) and V 2 = V * -V 1 . By setting U = (-U | U 1 ) and V = (V | V 1 ) we can write A * -A t-1 = U V + U 1 V 2 + U 2 V * where im(U ) = im(U ) and im(V ) = im(V ). Also, note that rank(U 1 V 2 ) ≤ rank(V 2 ) ≤ rank(V * ) = rank(A * ) ≤ r * and similarly rank(U 2 V * ) ≤ r * . So now the right hand side of (8) can be reshaped as ∇R(A t-1 ), A * -A t-1 + ρ - r+r * 2 A * -A t-1 2 F = ∇R(A t-1 ), U V + U 1 V 2 + U 2 V * + ρ - r+r * 2 U V + U 1 V 2 + U 2 V * 2 F Now, note that since by definition the columns of U are in im(U ) and the columns of V are in im(V ), Lemma A.8 implies that ∇R(A t-1 ), U V = 0. Therefore the above is equal to ∇R(A t-1 ), U 1 V 2 + U 2 V * + ρ - r+r * 2 U V + U 1 V 2 + U 2 V * 2 F ≥ ∇R(A t-1 ), U 1 V 2 + ∇R(A t-1 ), U 2 V * + ρ - r+r * 2 U 1 V 2 2 F + U 2 V * 2 F ≥ 2 min rank(M )≤r * ∇R(A t-1 ), M + ρ - r+r * 2 M 2 F = -2 H r * (∇R(A t-1 )) 2 F 2ρ - r+r * ≥ -r * ∇R(A t-1 ) 2 2 ρ - r+r * where the first equality follows by noticing that the columns of V and V 1 are orthogonal to those of V 2 and the columns of U and U 1 are orthogonal to those of U 2 , and applying Lemma A.7. The last equality is a direct application of Lemma A.6 and the last inequality states that the largest squared singular value is not smaller than the average of the top r * squared singular values. Therefore we have concluded that ∇R(A t-1 ) 2 2 ≥ ρ - r+r * r * (R(A t-1 ) -R(A * )) Plugging this back into the smoothness inequality, we get R(A t-1 ) -R(A t ) ≥ 1 2r * κ (R(A t-1 ) -R(A * )) or equivalently R(A t ) -R(A * ) ≤ 1 - 1 2r * κ (R(A t-1 ) -R(A * )) . Therefore after L = 2r * κ log R(A0)-R(A * ) iterations we have R(A T ) -R(A * ) ≤ 1 - 1 2r * κ L (R(A 0 ) -R(A * )) ≤ e -L 2r * κ (R(A 0 ) -R(A * )) ≤ Since A 0 = 0, the result follows.

A.3 PROOF OF THEOREM 2.2 (LOCAL SEARCH)

Proof. Similarly to Section A.3, we let A t-1 be the current solution before iteration t -1 ≥ 0. Let u ∈ R m and v ∈ R m be left and right singular vectors of matrix ∇R(A), i.e. unit vectors maximizing | ∇R(A), uv | and let B t = {B|B = A t-1 + ηuv T -σ min xy , η ∈ R}, where σ min xy = A t-1 -H r-1 (A t-1 ) is the rank-1 term corresponding to the minimum singular value of A t-1 . By smoothness we have R (A t-1 ) -R(A t ) ≥ max B∈Bt {R(A t-1 ) -R(B)} ≥ max B∈Bt -∇R(A t-1 ), B -A t-1 - ρ + 2 2 B -A t-1 2 F = max η∈R -∇R(A t-1 ), ηuv -σ min xy - ρ + 2 2 ηuv -σ min xy 2 F ≥ max η∈R -∇R(A t-1 ), ηuv -η 2 ρ + 2 -σ 2 min ρ + 2 = max η∈R η ∇R(A t-1 ) 2 -η 2 ρ + 2 -σ 2 min ρ + 2 = ∇R(A t-1 ) 2 2 4ρ + 2 -σ 2 min ρ + 2 , where in the last inequality we used the fact that ∇R(A t-1 ), xy = 0 following from Lemma A.8, as well as Lemma A.1. On the other hand, by strong convexity, R(A * ) -R(A t-1 ) ≥ ∇R(A t-1 ), A * -A t-1 + ρ - r+r * 2 A * -A t-1 2 F . Let A t-1 = U V and A * = U * V * . We write A * = U * V * = (U 1 + U 2 )(V 1 + V 2 ) = U 1 V 1 + U 1 V 2 + U 2 V * where U 1 is a matrix where every column of U * is replaced by its projection on im(U ) and U 2 = U * -U 1 and similarly V 1 is a matrix where every column of V * is replaced by its projection on im(V ) and V 2 = V * -V 1 . By setting U = (-U | U 1 ) and V = (V | V 1 ) we can write A * -A t-1 = U V + U 1 V 2 + U 2 V * where im(U ) = im(U ) and im(V ) = im(V ). Also, note that rank(U 1 V 2 ) ≤ rank(V 2 ) ≤ rank(V * ) = rank(A * ) ≤ r * and similarly rank(U 2 V * ) ≤ r * . So we now have ∇R(A t-1 ), A * -A t-1 + ρ - r+r * 2 A * -A t-1 2 F = ∇R(A t-1 ), U V + U 1 V 2 + U 2 V * + ρ - r+r * 2 U V + U 1 V 2 + U 2 V * 2 F = ∇R(A t-1 ), U 1 V 2 + U 2 V * + ρ - r+r * 2 U V + U 1 V 2 + U 2 V * 2 F = ∇R(A t-1 ), U 1 V 2 + U 2 V * + ρ - r+r * 2 U V 2 F + U 1 V 2 2 F + U 2 V * 2 F ≥ ∇R(A t-1 ), U 1 V 2 + ∇R(A t-1 ), U 2 V * + ρ - r+r * 2 U 1 V 2 2 F + U 2 V * 2 F + ρ - r+r * 2 U V 2 F ≥ 2 min rank(M )≤r * ∇R(A t-1 ), M + ρ - r+r * 2 M 2 F + ρ - r+r * 2 U V 2 F = -2 H r * (∇R(A t-1 )) 2 F 2ρ - r+r * + ρ - r+r * 2 U V 2 F ≥ -r * ∇R(A t-1 ) 2 2 ρ - r+r * + ρ - r+r * 2 U V 2 F where the second equality follows from the fact that ∇R(A t-1 ), uv = 0 for any u ∈ im(U ), v ∈ im(V ), the third equality from the fact that im(U 2 ) ⊥ im(U ) ∪ im(U 1 ) and im(V 2 ) ⊥ im(V ) and by applying Lemma A.7, and the last inequality from the fact that the largest squared singular value is not smaller than the average of the top r * squared singular values. Now, note that since rank(U 1 V 1 ) ≤ r * < r = rank(U V ), U V 2 F = U 1 V 1 -U V 2 F = r i=1 σ 2 i (U 1 V 1 -U V ) ≥ r i=1 (σ i+r * (U V ) -σ r * +1 (U 1 V 1 )) 2 = r i=r * +1 σ 2 i (U V ) ≥ (r -r * )σ 2 min (U V ) = (r -r * )σ 2 min (A t-1 ) , where we used the fact that rank(U 1 V 1 ) ≤ r * together with Lemma A.5. Therefore we have concluded that as long as r ≥ r * (1 + 8 κ 2 ), or equivalently, R(A t ) -R(A * ) ≤ 1 -1 4r * κ (R(A t-1 ) -R(A * )) . ∇R(A t-1 ) 2 2 ≥ ρ - Therefore after L = 4r * κ log R(A0)-R(A * ) iterations we have R(A T ) -R(A * ) ≤ 1 - 1 4r * κ L (R(A 0 ) -R(A * )) ≤ e -L 4r * κ (R(A 0 ) -R(A * )) ≤ Since A 0 = 0 and κ ≤ κ r+r * , the result follows.

A.4 TIGHTNESS OF THE ANALYSIS

It is important to note that the κ r+r * factor that appears in the rank bounds of both Theorems 2.1 and 2.2 is inherent in these algorithms and not an artifact of our analysis. In particular, such lower bounds based on the restricted condition number have been previously shown for the problem of sparse linear regression. More specifically, Foster et al. (2015) showed that there is a family of instances in which the analogues of Greedy and Local Search for sparse optimization require the sparsity to be Ω(s * κ ) for constant error > 0, where s * is the optimal sparsity and κ is the sparsity-restricted condition number. These instances can be easily adjusted to give a rank lower bound of Ω(r * κ r+r * ) for constant error > 0, implying that the κ dependence in Theorem 2.1 is tight for Greedy. Furthermore, specifically for Local Search, Axiotis & Sviridenko (2020) additionally 



Fast Local Search 1: procedure FAST LOCAL SEARCH(r ∈ N : target rank) 2: function to be minimized R : R m×n → R 3:U, V ← solution returned by FAST GREEDY(r)

i.d. normal with mean 0 and the entries of Ω are chosen i.i.d. uniformly at random over the set [100] × [100]. The experiments have three parameters: The true rank r * (of U V ), the percentage of observed entries p = |Ω|/10 4 , and the signal-to-noise ratio SNR. We measure the normalized MSE, i.e. Π Ω (M -A) 2

Figure 1: Objective value error vs rank in the problem of Section 3.1.

one can see an example of foreground-background separation from video using robust PCA. The video is from the BMC 2012 datasetVacavant et al. (2012). In this problem, the lowrank part corresponds to the background and the sparse part to the foreground. We compare three algorithms: Our Fast Greedy algorithm, standard PCA with 1 component (the choice of 1 was picked to get the best outcome), and the standard Principal Component Pursuit (PCP) algorithm (Candès et al. (2011)), as implemented inLin et al. (

Figure3: Test error vs rank in the matrix completion problem of Section 4.2. Bands of ±1 standard error are shown. Note that SoftImpute starts to overfit for ranks larger than 12 in (a). The "jumps" at around rank 5-7 happen because of overshooting (taking too large a step) during the insertion the rank-1 component in both Fast Greedy and Fast Local Search. More specifically, these implementations only apply 3 iterations of the inner optimization step, which in some cases are too few to amend the overshooting. However, after a few more iterations of the algorithm the overshooting issue is resolved (i.e. the algorithm has had enough iterations to scale down the rank-1 component that caused the overshooting).

r+r * r * (R(A t-1 ) -R(A * )) + (ρ - r+r * ) 2 (r -r * ) A t-1 ) -R(A * ))

Figure4: One of the splits of the Movielens 100K dataset. We can see that for small ranks the Fast Local Search solution is better and more stable, but for larger ranks it does not provide any improvement over the Fast Greedy algorithm.

Figure 5: Test error vs rank in the matrix completion problem of Section 4.2. Bands of ±1 standard error are shown.

Figure 6: Performance of greedy with fully solving the inner optimization problem (left) and applying 3 iterations of the LSQR algorithm (right) in the matrix completion problem of Section 4.2. k = 5, p = 0.2, SN R = 10. Bands of ±1 standard error are shown. This experiment shows why it is crucial to apply some kind of regularization to the Fast Greedy and Fast Local Search algorithms for machine learning applications.

Lowest test error for any rank in the matrix completion problem of Section 4.2, and associated rank returned by each algorithm. In the form error/rank.

(Algorithm 1) is larger as rank increases, since the fast algorithms scale linearly with rank, but the basic greedy scales quadratically. Mean RMSE and standard error among 5 random splits for 100K and 1M with standard errors < 0.01, and 3 random splits for 10M with standard errors < 0.001. The rank of the prediction is set to 100 except for NMF where it is 15 and Fast Greedy in the 10M dataset where it is chosen to be 35 by cross-validation. Alternating Minimization is a well known algorithm (e.g.Srebro  et al. (

Runtimes (in seconds) of different algorithms for fitting a rank=30 solution in various experiments. Code written in python and tested on an Intel Skylake CPU with 16 vCPUs.

annex

Shai Shalev-Shwartz, Alon Gonen, and Ohad Shamir. Large-scale convex minimization with a lowrank constraint. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 329-336, 2011 .Nathan Srebro, Jason Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization. Advances in neural information processing systems, 17: [1329] [1330] [1331] [1332] [1333] [1334] [1335] [1336] 2004 .Arthur Szlam, Yuval Kluger, and Mark Tygert. An implementation of a randomized algorithm for principal component analysis. arXiv preprint arXiv:1412. 

A APPENDIX

A.1 PRELIMINARIES AND NOTATION Given an positive integer k, we denote [k] = {1, 2, . . . , k}. Given a matrix A, we denote by A F its Frobenius norm, i.e. the 2 norm of the entries of A (or equivalently of the singular values of A).The following lemma is a simple corollary of the definition of the Frobenius norm:Definition A.2 (Rank-restricted smoothness, strong convexity, condition number). Given a convex function R ∈ R m×n → R and an integer parameter r, the rank-restricted smoothness of R at rank r is the minimum constant ρ + r ≥ 0 such that for any two matrices A ∈ R m×n , B ∈ R m×n such that rank(A -B) ≤ r, we haveSimilarly, the rank-restricted strong convexity of R at rank r is the maximum constant ρ - r ≥ 0 such that for any two matrices A ∈ R m×n , B ∈ R m×n such that rank(A -B) ≤ r, we haveGiven that ρ + r , ρ - r exist and are nonzero, the rank-restricted condition number of R at rank r is then defined asNote that ρ + r is increasing and ρ - r is decreasing in r. Therefore, even though our bounds are proven in terms of the constants Definition A.3 (Spectral norm). Given a matrix A ∈ R m×n , we denote its spectral norm by A 2 . The spectral norm is defined asDefinition A.4 (Singular value thresholding operator). Given a matrix A ∈ R m×n of rank k, a singular value decompositionIn other words, H r (•) is an operator that eliminates all but the top r highest singular values of a matrix.Lemma A.5 (Weyl's inequality). For any matrix A and integer i ≥ 1, let σ i (A) be the i-th largest singular value of A or 0 if i > rank(A). Then, for any two matrices A, B and integers i ≥ 1, j ≥ 1:A proof of the previous fact can be found e.g. in Fisk (1996) . Lemma A.6 (H r (•) optimization problem). Let A ∈ R m×n be a rank-k matrix and r ∈ [k] be an integer parameter. Then M = 1 λ H r (A) is an optimal solution to the following optimization problem:Proof. Let U ΣV = i Σ ii U i V i be a singular value decomposition of A. We note that ( 6) is equivalent toii . On the other hand, by applying Weyl's inequality (Lemma A.5) for j = r + 1,where the last equality follows from the fact that rank(A) = k and rank(M ) ≤ r. Therefore, M = 1 λ H r (A) minimizes ( 7) and thus maximizes (6).A.2 PROOF OF THEOREM 2.1 (GREEDY)We will start with the following simple lemma about the Frobenius norm of a sum of matrices with orthogonal columns or rows: Lemma A.7. Let U ∈ R m×r , V ∈ R n×r , X ∈ R m×r , Y ∈ R n×r be such that the columns of U are orthogonal to the columns of X or the columns of V are orthogonal to the columns of Y . ThenProof. If the columns of U are orthogonal to those of X, then U X = 0 and if the columns of V are orthogonal to those of Y , then Y V = 0. Therefore in any caseshowed that there is a family of instances in which the analogue of Local Search for sparse optimization requires a sparsity of Ω(s * (κ ) 2 ). Adapting these instances to the setting of rank-constrained convex optimization is less trivial, but we conjecture that it is possible, which would lead to a rank lower bound of Ω(r * κ 2 r+r * ) for Local Search. We present the following lemma, which essentially states that sparse optimization lower bounds for Orthogonal Matching Pursuit (OMP, Pati et al. (1993) ) (resp. Orthogonal Matching Pursuit with Replacement (OMPR, Jain et al. (2011) )) in which the optimal sparse solution is also a global optimum, immediately carry over (up to constants) to rank-constrained convex optimization lower bounds for Greedy (resp. Local Search). Lemma A.9. Let f ∈ R n → R and x * ∈ R n be an s * -sparse vector that is also a global minimizer of f . Also, let f have restricted smoothness parameter β at sparsity level s + s * for some s ≥ s * and restricted strong convexity parameter α at sparsity level s + s * . Then we can define the rankconstrained problem, with R : R n×n → R,where diag(A) is a vector containing the diagonal of A. R has rank-restricted smoothness at rank s + s * at most 2β and rank-restricted strong convexity at rank s + s * at least α. Suppose that we run t iterations of OMP (resp. OMPR) starting from a solution x, to get solution x , and similarly run t iterations of Greedy (resp. Local Search) starting from solution A = diag(x) (where diag(x) is a diagonal matrix with x on the diagonal) to get solution A . Then A is diagonal and diag(A ) = x .In other words, in this scenario OMP and Greedy (resp. OMPR and Local Search) are equivalent.Proof. Note that for any solution) is an optimal solution of (9). Now, given any diagonal solution A of ( 9) such that A = diag(x), we claim that one step of either Greedy or Local Search keeps it diagonal. This is becauseTherefore the largest eigenvalue of ∇R(A) has corresponding eigenvector 1 i for some i, which implies that the rank-1 component which will be added is a multiple of 1 i 1 i . For the same reason the rank-1 component removed by Local Search will be a multiple of 1 j 1 j for some j. Therefore running Greedy (resp. Local Search) on such an instance is identical to running OMP (resp. OMPR) on the diagonal.Together with the lower bound instances of Foster et al. (2015) (in which the global minimum property is true), it immediately implies a rank lower bound of Ω(r * κ r+r * ) for getting a solution with constant error for rank-constrained convex optimization. On the other hand, the lower bound instances of Axiotis & Sviridenko (2020) give a quadratic lower bound in κ for OMPR. The above lemma cannot be directly applied since the sparse solutions are not global minima, but we conjecture that a similar proof will give a rank lower bound of Ω(r * κ 2 r+r * ) for rank-constrained convex optimization with Local Search.A.5 ADDENDUM TO SECTION 4

