Oblivious Sketching-based Central Path Method for Solving Linear Programming Problems

Abstract

In this work, we propose a sketching-based central path method for solving linear programmings, whose running time matches the state of art results Cohen et al. (2019b); Lee et al. (2019). Our method opens up the iterations of the central path method and deploys an "iterate and sketch" approach towards the problem by introducing a new coordinate-wise embedding technique, which may be of independent interest. Compare to previous methods, the work Cohen et al. (2019b) enjoys feasibility while being nonoblivious, and Lee et al. ( 2019) is oblivious but infeasible, and relies on dense sketching matrices such as subsampled randomized Hadamard/Fourier transform matrices. Our method enjoys the benefits of being both oblivious and feasible, and can use sparse sketching matrix Nelson & Nguyên (2013) to speed up the online matrix-vector multiplication. Our framework for solving LP naturally generalizes to a broader class of convex optimization problems including empirical risk minimization.

1. Introduction

Linear programming is one of the fundamental models widely used in both theory and practice. It has been extensively applied in many fields such as economics Tintner (1955) ; Dorfman et al. (1987) , operations research Delson & Shahidehpour (1992) , compressed sensing Donoho (2006) ; Candes et al. (2006) , medical studies Mangasarian et al. (1990; 1995) , adversarial deep learning Wong & Kolter (2018) ; Weng et al. (2018) , etc., due to its simple and intuitive structure. The problem of solving linear programmings has been studied since the 19-th century Sierksma & Zwols (2015) . Consider solving a general linear program in standard form min Ax=b,x≥0 c x of size A ∈ R d×n without redundant constraints. For the generic case d = Ω(n) we considered in this paper, the state of art results take a total running time of O * (n ω + n 2.5-α/2 + n 2+1/6 )foot_0 to obtain a solution of δ accuracy in current matrix multiplication time Cohen et al. (2019b) ; Lee et al. (2019) , where ω is the exponent of matrix multiplication whose current value is roughly 2.373 Williams (2012); Le Gall (2014), and α is the dual exponent of matrix multiplication whose current value is 0.31 Le Gall & Urrutia (2018) . The breakthrough work due to Cohen, Lee, and Song Cohen et al. (2019b) improves the long standing running time of O * (n 2.5 ) since 1989 Vaidya (1989) . For the current ω and α, Cohen et al. (2019b) algorithm takes O * (n 2.373 ) time. For the current state-of-art results, the work Cohen et al. (2019b) involves a non-oblivious sampling technique, whose sampling set and size changes along the iterations. It avoids the possibilities of implementing expensive calculations in the preprocessing stage and also makes it harder to extend to other classical optimization problems. On the other hand, the work Lee et al. (2019) only maintains an infeasible update in each iteration and requires the usage of dense sketching matrices, which will ruin the potential sparsity structure of the original linear programs. Thus, a natural question to ask is: Is there an oblivious and feasible algorithm for solving linear programming programs in fast running time (i.e. current matrix multiplication time) ? In this work, we propose a both oblivious and feasible (per iteration)foot_1 method that solves linear programs in the same running time as the state of art. The algorithm we propose is a sketching-based short step central path method. The classical short step method follows the central path in the interior of the feasible region. It decreases the complementarity gap uniformly by roughly a 1 -1/ √ n factor in each iteration and takes The coordinate-wise embedding we introduce in this work is a distribution of matrices R ∈ R b sketch ×n with b sketch n such that, for any inner product g h between two ndimensional vector g, h ∈ R n , with "high" probability g R Rh approximates g h well. In the case of solving linear programmings, we approximate the calculation of matrix-vector multiplication P h in each iteration by P R Rh through OCE, such that the resulting random vector is close to previous one in each coordinate, i.e., (P R Rh) i ≈ (P h) i for all i ∈ [n]. Combining with lazy update and low-rank update techniques to maintain the query structure P R Rh for any input vector h ∈ R n , we can ensure the new random path is still close to the central path throughout the iterations. Therefore, our method decrease the average running time per iteration while keeping the same number of iterations. Furthermore, the sketching matrix R in our approach can be chosen in an oblivious way since it does not depend on the algorithm updates. Compare to previous work Lee et al. (2019) , our approximation form P R Rh also helps admit a closed form solution in each iteration for solving LP. Thus, our approach takes the advantages of being oblivious and feasible, compared to other state of art results Cohen et al. (2019b) ; Lee et al. (2019) . O * ( √ n) iterations to converge. This results in O * ( √ n) × n = O * (n We state our main result as follows: Theorem 1.1 (Main result, informal). Given a linear program min Ax=b,x≥0 c x with no redundant constraints. Let δ lp denotes precision. Our algorithm takes O(n 2.373 log(n/δ lp )) time to solve this LP.

1.1. Related works

Linear programming. Linear programmings have been studied for nearly a century. One of the first and most popular LP algorithm is the simplex algorithm Dantzig (1947) . Despite it works well in practical small size problems, the simplex algorithm is known to be an exponential time algorithm in the worst case of Klee-Minty cube Klee & Minty (1972) . The first polynomial time algorithm for solving LP is the ellipsoid method Khachiyan (1980) proposed by Khachiyan. Although this algorithm runs in polynomial time in theory, but in practice this algorithm runs much slower than the simplex algorithm. The interior point type of methods Karmarkar (1984) have both polynomial running time in theory and fast and stable performance in practice. In the case of d = Ω(n) considered in this work, Karmarkar's algorithm Karmarkar (1984) takes O * (n 3.5 ) running time. Then it was improved to O * (n 3 ) in the work Renegar (1988); Vaidya (1987) . In 1989, Vaidya further proposed an algorithm that takes a running of O * (n 2.5 ). This result hasn't been improved until recent work due to Cohen, Lee and Song Cohen et al. (2019b) . Sketching. Classical sketching methodology proposed by Clarkson & Woodruff (2013) is the so-called "sketch and solve". The most standard and well-known applications are The sketching method we deploy in this work is called "'iterate and sketch" Song (2019) . The major difference between classical "sketch and solve", and "iterating and sketch" is: the first one only applied the sketch once at very beginning to reduce the dimension of problem, while does not modify the solver; the second one opens up and modifies the solver by applying sketching techniques iteratively in each iteration. The idea of "iterate and sketch" has been applied to a number of problems, e. Empirical risk minimization Empirical risk minimization (ERM) problem is a fundamental question in statistical machine learning. Extensive literature has been devoted to study this topic Nesterov (1983) ; Vapnik (1992) ; Nesterov (1998); Polyak & Juditsky (1992) ; Nemirovski et al. (2009) ; Nesterov (2013); Vapnik (2013) . First-order methods and a series of accelerated gradient descent algorithms for ERM are well-developed and studied Jin et al. (2018) ; Johnson & Zhang (2013) ; Nesterov & Stich (2017) ; Xiao & Zhang (2014) ; Allen-Zhu (2018) . These rates depend polynomially on the smoothness/strong convexity of the objective in order to achieve a log(1/ ) dependence on the error parameter . Notations For a positive integer n, we use [n] to denote set {1, 2, • • • , n}. For vectors x, z ∈ R n and parameter ∈ (0, 1), we use x ≈ z to denote (1-)z i ≤ x i ≤ (1+ )z i , ∀i ∈ [n]. For any scalar t, we use a ≈ t to denote (1 -)t ≤ a i ≤ (1 + )t, ∀i ∈ [n]. Given diagonal matrices X = diag(x) ∈ R n×n , S = diag(s) ∈ R n×n , we use X S to denote the diagonal matrix with ( X S ) i,i = x i /s i , ∀i ∈ [n].

2. Technique overview

In this section, we discuss the key ideas of our approach based on the central path method.

2.1. Short Step Central Path Method

Consider the following standard primal and dual problems of linear programmings: min Ax=b,x≥0 c x (primal) and max A y+s=c, x,s≥0 b y (dual) where A ∈ R d×n is full rank with d = O(n). Then (x, y, s) is an optimal solution if and only if it satisfies the following optimality conditions Vanderbei et al. ( 2015): Ax = b, x ≥ 0 (primal feasibility) A y + s = c, s ≥ 0 (dual feasibility) x i s i = 0 for all i (complementary slackness) The classical interior point method finds an optimal solution by following the central path in the interior of the feasible region, which is defined as the tuple (x, y, s, t) that satisfies: Ax = b, x > 0 A y + s = c, s > 0 (1) x i s i = t for all i where t > 0 is called the complementarity gap. It has been shown we can obtain an initialization point on the central path with t = 1 according to Ye et al. (1994) . Then in each iteration, the classical algorithm deceases the complementarity gap uniformly from t to ηt with η < 1, and solves Eq. ( 1). As t approaches 0, the central path will converge to an optimal solution. The short step central path method approximately solves Eq. ( 1) by the following linear system: Xδ s + Sδ x = δ µ , Aδ x = 0, A δ y + δ s = 0, where X = diag(x), S = diag(s) and we update the solution by x = x + δ x , s = s + δ s and y = y + δ y . Denote the actual complementarity gap µ ∈ R n defined under Eq. ( 2) as µ i = x i s i for i ∈ [n]. Then Eq. ( 2) maintains the feasibility conditions while approximately moving the gap from µ to µ + δ µ . As long as the actual complementarity gap µ is always close to the aiming complementarity gap t during the algorithm, we can assure the actual complementarity gap µ will converge to 0 as t goes to 0, which leads us to an optimal solution. To solve the linear system (2), note when A is full-rank, it has an unique solution explicitly given by: δ x = X √ XS (I -P ) 1 √ XS δ µ and δ s = S √ XS P 1 √ XS δ µ , where P = X S A A X S A -1 A X S is an orthogonal projection matrix. Literature Vaidya (1989) shows we can choose η to be roughly 1 -1 √ n , and the algorithm converges in O * ( √ n) iterations. Therefore, the total running time needed of solving LP by explicit solution Eq. ( 3) is O * (n ω+1/2 ).

2.2. Sketching-based Central Path Method

In the following subsections, we discuss our approach of sketching-based central path method. In Subsection 2.3, we introduce the coordinate-wise embedding (CE) technique. We discuss the difference between CE and classical sketching techniques, such as Johnson-Lindenstrauss (JL) Lemma and subspace embedding (SE). We also discussed the results of applying common sketching matrices in CE. In Subsection 2.4, we explain why our sketching-based central path method can speed up the computation. In Subsection 2.5, we explain the reason why our sketch-based central path method is feasible and oblivious. In Subsection 2.6, we discuss the projection maintenance needed for the algorithm updates. Algorithm 1 Main algorithm (simplified) t ← 1 Initialize the aiming gap t 6: while t > δ 2 lp /(32n 3 ) do Stop once the precision is good 7: t new ← (1 -3 √ n )t Decrease the aiming gap by 1 -1/ √ n in each iteration 8: µ ← xs Actual gap 9: δµ ← ( t new t -1)xs -2 • t new • ∇Φ λ (µ/t-1) ∇Φ λ (µ/t-1) 2 Here Φ λ (r) := n i=1 cosh(λri) is the potential function characterize the ∞ closeness between actual path µ and actual path t. We have Φ λ in the update to help ensure u ≈0.1 t. 10: ( 

2.3. Coordinate-wise Embedding

To speed up the classical central path method, we introduce the coordinate-wise embedding (CE) as follows: Definition 2.1 ((α, β, δ)-coordinate wise embedding (CE)). Given parameters α, β ∈ R and γ ∈ (0, 1), we say a randomized matrix R ∈ R b sketch ×n with distribution Π satisfies (α, β, δ)-coordinate-wise embedding property if for any fixed vector g, h ∈ R n , we have 1. E R∼Π [g R Rh] = g h, 2. E R∼Π [(g R Rh) 2 ] ≤ (g h) 2 + α b sketch g 2 2 h 2 2 , 3. Pr R∼Π |g R Rh -g h| ≥ β √ b sketch g 2 h 2 ≤ δ. We remark that the (α, β, δ)-coordinate wise embedding we proposed here is different from the conventional Johnson-Lindenstrauss Lemma Johnson & Lindenstrauss (1984) or subspace embedding Sarlós (2006) in classical literature. The coordinate-wise embedding we proposed in Definition 2.1 only works for a fixed vector pair g, h ∈ R n . While for the Johnson-Lindenstrauss embedding stated below: Definition 2.2 (Johnson-Lindenstrauss embedding (JL) Johnson & Lindenstrauss (1984) ). Given 0 < < 1, a finite point set X in R n with |X| = m, we say a randomized matrix R ∈ R b×n satisfies Johnson-Lindenstrauss property if (1 -) • g 2 2 ≤ Rg 2 2 ≤ (1 + ) • g 2 2 for all g ∈ X. The result of JL embedding works for a set of points in R n . However, subspace embedding holds for all the vectors from the subspace. Definition 2.3 (Subspace embedding (SE) Sarlós ( 2006)). Given 0 < < 1, a matrix A ∈ R n×d , we say a randomized matrix R ∈ R b×n satisfies (1 ± ) 2 -subspace embedding for the column space of A if RAx 2 2 = (1 ± ) Ax 2 2 , for all x ∈ R d . The subspace embedding property works for all vectors from the subspace R d .

Several well-known sketching matrices

To further concretize our sketching approach, we discuss the following commonly used sketching matricesand their corresponding properties when acting as coordinate-wise embedding and solving LP. Considering an oblivious regime, where the size of sketching b sketch is fixed, we have Lemma 2.4 (Oblivious coordinate-wise embedding properties). For above defined sketching matrices, they are of (α, β, δ)-coordinate wise embedding as following:

Random Gaussian matrix

Sketching matrix α β LP? (Left) LP? (Right) Random Gaussian O(1) O(log 1.5 (n/δ)) Yes Yes SRHT O(1) O(log 1.5 (n/δ)) Yes Yes AMS O(1) O(log 1.5 (n/δ)) Yes Yes Count-sketch O(1) O( √ b sketch log(1/δ) or O(1/ √ δ) No No Sparse embedding O(1) O( b sketch /s log 1.5 (n/δ) No † Yes * Uniform sampling O(n) O(n/ √ b sketch ) No No Table 2 : Summary for different sketching matrices. * A sparse embedding sketching matrix can be used in LP algorithm when it is added on the right and s = Ω(logfoot_2 (n/δ)). † However when sketching on the left (in Lee et al. ( 2019)), additional algorithmic designs are needed to make the algorithm feasible (see Section F.1 for more discussion), and the error of the feasibility part cannot be bounded unless s = Ω(b sketch ). Remark 2.5. The approach in Cohen et al. (2019b) behaves similarly as applying uniform sampling matrix in our sketching, which doesn't work in an oblivious setting. Therefore, Cohen et al. (2019b) needs to modify the sketching size in each iteration. In general, to apply the sketching in an oblivious way, we observe that the sketching matrix should be relatively dense to better concentrate around its expectation, so that we can control the extra perturbation introduced by random sketching in solving linear programming problems.

2.4. Speeding up central path method through OCE

To speed up the classical central path method, we randomize the calculation of Eq. (3) by δ x = X √ XS (I -P )R R 1 √ XS δ µ and δ s = S √ XS P R R 1 √ XS δ µ , where the random sketching matrix R ∈ R b sketch ×n satisfies the (α, β, δ)-coordinate wise embedding property with α = O(1), β = O( √ b sketch ) and any failure probability δ ∈ (0, 1). The coordinate-wise embedding properties 2.1 ensures the randomized Eq. ( 4) are well concentrated around original Eq. ( 3), which implies the new randomized path µ, will still be near the aiming path t during the algorithm updates. Therefore, given the same decreasing rate of aiming path t as before, we are able to prove µ ≈ 0.1 t throughout the iterations. As they converge to zero, we are still able to obtain an optimal solution in O * ( √ n) iterations. In terms of running time, note Eq. ( 4) actually reduces the dimension of previous matrixvector multiplication from n to b sketch . Assume we can maintain the sketched projection matrix P R in an efficient manner, the calculation in Eq. ( 4) reduces to (P R ) • u for some vector u ∈ R b sketch , which is a multiplication between a matrix of size n × b sketch and a vector of size b sketch and costs O(nb sketch ) running time. Choosing b sketch to be O * ( √ n), we speed up the updates. We summarize our approach in Algorithm 1, 2. We discuss how to maintain the projection in Section 2.6.

2.5. Feasible central path equation via sketching

To explain the strength of our approach, we discuss the feasible and oblivious advantages of our method over past state-of-art results. The new update Eq. ( 4) can be viewed as an exact solution of the following linear system: X δ s + S δ x = δ µ , A δ x = 0, A δ y + δ s = 0, where δ µ = √ XSR R 1 √ XS δ µ . Therefore, our approach can also be viewed as an update of a subspace of the complementarity gaps in each iteration, instead of decreasing the complementarity gaps uniformly. Note in each iteration, our update solves the new linear system (5) exactly. Compared to the state-of-art approach Lee et al. (2019) which constructs a solution that solves the linear system inexactly, the feasibility of our approach prevents us from the complicated analysis. Our approach is also able to use sparse embedding matrix to prevent ruining the potential sparsity structure of the original linear programs, compared to the usage of dense sketching matrices in the work Lee et al. (2019) . On the other hand, our method is oblivious since the choice of sketching matrix R ∈ R b sketch ×n does not depend on the algorithm updates, which implies we can pick the sketching matrices at the preprocessing stage. While for the state-of-art approach Cohen et al. (2019b), its sampling probability depends on the algorithm updates and needs to be calculated on-the-fly.

2.6. Projection maintenance

In this section, we discuss our approach to deal with the second computational bottleneck, i.e., how to maintain the projection after sketching P R ∈ R n×b sketch in an efficient way, where P ∈ R n×n is the orthogonal projection matrix defined in Eq. ( 3) and R ∈ R b sketch ×n is a random sketching matrix with appropriate (α, β, δ)-coordinate-wise embedding property. Let W := diag(w) ∈ R n denotes the diagonal matrix with w i = x i /s i . Then we have P := √ W A (AW A ) -1 A √ W ∈ R n×n . Therefore, our final goal of implementing of Eq. ( 4) reduces back to the task of maintaining the query structure which outputs: P R Rh = (P R ) • (R • h) = (P R ) • u (7) where u ∈ R b sketch . Under review as a conference paper at ICLR 2021 To achieve this, we have the similar observation as in Cohen et al. (2019b) : W doesn't vary much between two iterations under the sketching approach, which is shown in the following lemma: Lemma 2.6 (Change of W ). Let 0 < < 1/(40000 log n). Let w i and w new i denote the value x i /s i in two consecutive iterations, then we have n i=1 (E[ln w new i ] -ln w i ) 2 ≤ 64 2 , n i=1 (Var[ln w new i ]) 2 ≤ 1000 2 . Above observation motivates us to take the benefit of lazy update if w i only has little changes since we only need to maintain the projection approximately. We discuss two extreme scenarios to illustrate the core ideas: 1) w changes uniformly across all coordinates and 2) w only changes in few coordinates. In the first case, we use the idea of lazy Lemma 2.6 implies the changes of w between two iterations are roughly In the second case, we use the idea of low-rank update. Instead of updating P R in each iteration, we directly compute P R u using the Woodbury matrix identity. Since w only changes in few coordinates, we only need to focus on computing the inverse of small matrix instead of the original n × n matrix. And computing P R u instead of P R fastens the computation because we only need to do matrix-vector multiplication. As a result, we can output P R u in O * (nb sketch ) time. Recall we choose b sketch = O * (n 1/2 ). Therefore, the running time for this case is O * (n 2 ), which is also within our budget. w new i ≈ (1 ± 1 √ n )w i . For general cases, we combine above techniques together and have the following theorem: Theorem 2.7 (Projection maintenance). Given a number a ∈ (0, α)foot_3 and sketching matrices R ∈ R n b ×n with b ∈ [0, 1]. We can approximately maintain the projection by 1. Update(w): Output a vector v such that for all i, (1mp ) v i ≤ w i ≤ (1 + mp ) v i . 2. Query(h): Output V A (A V A ) -1 A V (R ) * ,l R l, * h for the v outputted by the last call to Update. The data structure takes n 2 d ω-2 time to initialize, each call of Query(h) takes time O * (n 1+b + n 1+a ), and the amortized expected time per call of Update(w) is O * (n ω-1/2 + n 2-a/2 ). Note our approach maintains the projection at point V = X/S instead of W = X/S, where x i = x i v i /w i ≈ 0.1 x i and s i = s i w i / v i ≈ 0.1 s i . Equivalently speaking, instead of solving Eq. ( 5), we are solving Xδ s + S δ x = δ µ , A δ x = 0, A δ y + δ s = 0, where δ µ = √ XSR R 1 √ XS δ µ as shown in Algorithm 2. And the final running time of our algorithm can be bounded bt O * (n ω ). Corollary 2.8 (Extension to ERM). Our approach for solving LP naturally generalizes to other convex optimization problems of the following form Lee et al. (2019) , including empirical risk minimization: min x i f i (A i x + b i ), where f i is convex function on R ni with n = i n i . Our algorithm output the solution in time O * (n 2.373 log(n/δ)), where δ is the precision parameter.



We use O * hides n o(1) and logO(1) (1/δ) factors. In each iteration, we approximate the central path by solving a linear system. Our approach constructs a randomized oblivious system equation which can be solved exactly. While previous workCohen et al. (2019b) constructs a non-oblivious one, andLee et al. (2019) doesn't solve the system exactly. In this case, we require log n to be an integer. α is the dual exponent of matrix multiplication, whose current value is roughly 0.31 Le Gall & Urrutia (2018).



linear regression Clarkson & Woodruff (2013); Nelson & Nguyên (2013); Andoni et al. (2018); Clarkson et al. (2019); Song et al. (2019a) and low-rank approximation Clarkson & Woodruff (2013); Nelson & Nguyên (2013); Boutsidis & Woodruff (2014); Clarkson & Woodruff (2015b;a); Razenshteyn et al. (2016); Song et al. (2017; 2019b;c). It further generalizes to subspace Wang & Woodruff (2019); Li et al. (2020a), positive semi-definite matrices Clarkson & Woodruff (2017), total least regression Diao et al. (2019), quantile regression Li et al. (2020b), tensor regression Li et al. (2017); Diao et al. (2018), tensor decomposition Song et al. (2019d).

g. computing John Ellipsoid Cohen et al. (2019a), Newton method Pilanci & Wainwright (2016; 2017), tensor decomposition Wang et al. (2015); Song et al. (2016), training deep neural network Brand et al. (2020).

procedure Main(A, b, c, δ lp ) Theorem D.1 2: Modify the linear program and obtain an initial x and s according to Ye et al. (1994) 3: Ensure the initial complementarity gap start with xisi = 1 4: Initialize: sketching size b sketch = O * ( √ n), parameters = O * (1), projection maintenance datastructure mp 5:

All entries are sampled from N (0, 1/b sketch ) independently. SRHT matrix Lu et al. (2013) Let R = n/b sketch SHD, where S ∈ R b sketch ×n is a random matrix whose rows are b sketch uniform samples (without replacement) from the standard basis of R n , H ∈ R n×n is a normalized Walsh-Hadamard matrix, and D ∈ R n×n is a diagonal matrix whose diagonal elements are i.i.d. Rademacher random variables 3 . AMS sketch matrix Alon et al. (1999) Let R i,j = h i (j), where h 1 , h 2 , • • • , h b sketch are b sketch random hash functions picking from a random hash family H = {h : [n] matrix Charikar et al. (2002) Let R h(i),i = σ(i) for all i ∈ [n] and other entries to zero, where h : [n] → [b sketch ] and σ : [n] → {-1, +1} are random hash functions. Sparse embedding matrix Nelson & Nguyên (2013) Let R (j-1)b sketch /s+h(i,j),i = σ(i, j)/ √ s for all (i, j) ∈ [n] × [s] and all other entries to zero, where h : [n] × [s] → [b sketch /s] and σ : [n] × [s] → {-1, 1} are random hash functions. Uniform sampling matrix Let R = n/b sketch SD, where S ∈ R b sketch ×n is a random matrix whose rows are b sketch uniform samples (without replacement) from the standard basis of R n , and D ∈ R n×n is a diagonal matrix whose diagonal elements are i.i.d. Rademacher random variables.

Summary of the guarantees of different embeddings. The three embeddings give the 2 -norm guarantee for different number of vectors. Coordinate-wise embedding also guarantees the embedding is unbiased, and the variance is bounded. See Section G for more details.

Therefore, w i 's will only vary by more than a constant and possibly ruin our previous ∞ closeness after O * ( √ n) number of iterations. In this case, we only need to update the matrix P R once every O * ( √ n) iterations, while being "lazy" in any other time. Since the algorithm finishes in O * ( √ n) iterations, it means we only need to update matrix P R O * (1) many times, whose total running time is O * (n ω ).

