A New Accelerated Gradient Method Inspired by Continuous-Time Perspective

Abstract

Nesterov's accelerated method are widely used in problems with machine learning background including deep learning. To give more insight about the acceleration phenomenon, an ordinary differential equation was obtained from Nesterov's accelerated method by taking step sizes approaching zero, and the relationship between Nesterov's method and the differential equation is still of research interest. In this work, we give the precise order of the iterations of Nesterov's accelerated method converging to the solution of derived differential equation as step sizes go to zero. We then present a new accelerated method with higher order. The new method is more stable than ordinary method for large step size and converges faster. We further apply the new method to matrix completion problem and show its better performance through numerical experiments.

1. Introduction

Optimization is a core component of statistic and machine learning problems. Recently, gradientbased algorithms are widely used in such optimization problems due to its simplicity and efficiency for large-scale situations. For solving convex optimization problem min x∈R d F (x), where F (x) is convex and sufficiently smooth, a classical first-order method is gradient descent. We assume that f (x) = ∇F (x) satisfies L-Lipschitz condition, that is, there exists constant L such that ∥f (x) -f (y)∥ ≤ L∥x -y∥, ∀x, y. Under these conditions, gradient descent achieves a convergence rate of O(n -1 ), i.e., ∥F (x n ) -F (x * )∥ decreases to zero at a rate of O(n -1 ), where x n denotes the nth iteration and x * denotes the minimum point of F (x) in R d . Nesterov's accelerated method (Nesterov, 1983 ) is a more efficient first-order algorithm than gradient descent, of which we will use the following form: starting with x 0 = x 1 , y n = x n + n -3 n (x n -x n-1 ), x n+1 = y n -sf (y n ) (1.1) for n ≥ 1. It is shown that under abovementioned conditions, Nesterov's accelerated method converges at a rate of O(n -2 ). Accelerated gradient method has been successful in training deep and recurrent neural networks (Sutskever et al., 2013) and is widely used in problems with machine learning background to avoid sophisticated second-order methods (Cotter et al., 2011; Hu et al., 2009; Ji & Ye, 2009) . To provide more theorical understanding, an important research topic of Nesterov's accelerated method is to find an explanation of the acceleration. On this topic, Nesterov's method was studied via a continuous-time perspective (Su et al., 2014) . They considered a curve x(t), introduced the ansatz x n ≈ x(n √ s) and substituted it to (1.1). Letting s → 0, they obtained the following differential equation. ẍ + 3 t ẋ + f (x) = 0. (1.2) The differential equation was used as a tool for analyzing and generalizing Nesterov's scheme. Furthermore, this idea has been studied from different directions. A class of accelerated methods have been generated in continuous-time (Wibisono et al., 2016) . ODE (1.2) can also be discretized directly using Runge-Kutta method to achieve acceleration (Zhang et al., 2018) . Although many results have been achieved, the process of obtaining the differential equation (1.2) has not been rigorous, and the method is still time-consuming for large-scale problems. In this work, we give the precise order of the iterations of Nesterov's accelerated method converging to solution of the differential equation (1.2) with initial conditions x(0) = x 0 , ẋ(0) = 0 (1.3) as step size s goes to zero. Inspired from this perspective, we present a new accelerated method to make this convergence faster. As we expected, iterations of the new method are closer to the solution x(t) of differential equation (1.2) than original Nesterov's method. Moreover, we find the new method is more stable than original Nesterov's method when step size is large. Based on abovementioned observations, we try to take advantage of the new method in more practical problems. We apply the new method to matrix completion problem. We combine the new method with proximal operator (Parikh & Boyd, 2014) into a new algorithm, which we call modified FISTA. We find that the new method performs better than FISTA (Beck & Teboulle, 2009) and acclerated proximal gradient method (Parikh & Boyd, 2014) because it can work with larger step sizes. This paper is organized as follows. In section 2, we prove that iterations of Nesterov's accelerated method converge to solution of the differential equation (1.2). In section 3, we present a new method to make the convergence faster and show its better stablity through two simple examples. In section 4, we apply the new method to matrix completion problem. 2 A strict analysis of the relation between Nesterov's method and its continuous-time limit We refer to x(t) as the solution of differential equation (1.2) with initial conditions (1.3). Existance and uniqueness of such solutions have been proved (Su et al., 2014) . In this section, We give the order of the iterations of Nesterov's accelerated method converging to x(t) as step sizes go to zero. For convenience, we substitute the first equation in Nesterov's method (1.1) to the second one to get x n+1 = x n + n -3 n (x n -x n-1 ) -s • f ( x n + n -3 n (x n -x n-1 ) ) . We write s = h 2 and rewrite the above recurrence relation as x n+1 = x n + n -3 n (x n -x n-1 ) -h 2 • f ( x n + n -3 n (x n -x n-1 ) ) . (2.1) Inspired by the ansatz x n ≈ x(n √ s) (Su et al., 2014) , we consider the convergence between x n and x(nh). More precisely, we show that for fixed time t, x n converges to x(t) as h goes to zero, where n = t h .

2.1. Truncation error

Firstly, we consider the following 'truncation error': L[x(t); h] =x(t + h) - 2t -3h t x(t) + t -3h t x(t -h)+ h 2 f ( x(t) + t -3h t (x(t) -x(t -h)) ) . ( 

L[x(t); h] = O(h 3 ).

Theorem 1 shows the size of error caused by a single iteration when the starting point is just on x(t). Then we have to add up these errors to prove the convergence proporty we need.

2.2. Convergence theorem

We now come to the convergence theorem. In this theorem, we give the precise order of the iterations of Nesterov's method converging to solution of the derived differential equation. Theorem 2. Under conditions in Theorem 1, for fixed time t, x t/h converges to x(t) as h goes to zero at a rate of O(h ln 1 h ) if x 0 = x(0) and x 1 = x(h). Theorem 2 coincides with derivation of ODE (1.2) (Su et al., 2014) . 3 New accelerated method 3.1 Derivation of the new method and analysis of truncation error Inspired from the continuous-time perspective and our proof of the convergence from iterations of Nesterov's method to its continuous-time limit, we present a new method to make this convergence faster. Precisely, the new method has a higher truncation order. We need one more step in our scheme than in Nesterov's method to achieve higher truncation order in the following analysis, so we consider a recurrence relation with form 4 ∑ i=1 ( α i + β i n + γ i n 2 ) x n+2-i = -sf ( x n + n -3 n (x n -x n-1 ) ) , (3.1) where {α i }, {β i } and {γ i } are to be determined. Now we expand x(t -h) to first order. Calculation shows that f ( x(t) + t -3h t (x(t) -x(t -h)) ) = -hx (3) (t) - ( 3h t + 1 ) x (2) (t) + ( 3h t 2 - 3 t ) x (1) (t) + O(h 2 ). Substitute this expansion to truncation error L[x(t); h] = 4 ∑ i=1 ( α i + β i h t + γ i h 2 t 2 ) x(t + (2 -i)h) + h 2 f ( x(t) + t -3h t (x(t) -x(t -h)) ) , and choose parameters appropriately to eliminate low-order terms, we get the following recurrence relation x n+1 = 10n 2 + 9n + 6 4n 2 + 8n x n - 4n 2 + 3 2n 2 + 4n x n-1 + 2n -1 4n + 8 x n-2 - n 2n + 4 sf ( 2n -3 n x n - n -3 n x n-1 ) . (3.2) Here we rewrite this scheme as Algorithm 1. Algorithm 1 The new method (3.2) Input: step size s Initial value: X 2 = X 1 = X 0 . (k -1)th iteration (k ≥ 2). Compute Y k = 10k 2 + 9k + 6 4k 2 + 8k X k - 4k 2 + 3 2k 2 + 4k X k-1 + 2k -1 4k + 8 X k-2 , Z k = 2k -3 k X k - k -3 k X k-1 , X k+1 = X - ( Y k - ks 2k + 4 f (Z k ) ) . For truncation order of this new method, we have the following theorem. The abovementioned procedure is presented in Appendix A.4 detailedly, as proof of Theorem 3. Theorem 3. If f has continuous second order derivative, the first and second derivative are bounded, and x(t) has continuous fourth derivative, then for fixed t, truncation error of (3.2) satisfies L[x(t n ); h] = O(h 4 ). The convergence of the new method and x(t) can be proved similar to Theorem 2.

3.2. Advantage of the new method

Since the new method has a truncation error of higher order than original Neaterov's method, the iterations of the new method converge to the differential equation (1.2) when those of original Nesterov's method diverge. In another word, the new method is more stable for large step size. We present two numerical results in Figure 1 to confirm it. 2 , where n is the number of samples and (w i , y i ) is the ith sample. In these examples, iterations of the new method converge to the minimum point, while those of original Nesterov's method diverge, which confirms that the new method is more stable for large step size. Quadratic. F (x) = x T Ax is a strongly convex function, in which x ∈ R 2 and A is a 2 × 2 matrix. Linear regression. F (x) = n ∑ i=1 (y i -w T i x) (b) min F (x) = n ∑ i=1 (yi -w T i x) 2

3.3. Absolute stability of Nesterov's method and the new method

In this subsection, we explain the better stability of the new method with absolute stability theory. Firstly, recall the scheme of our new method x n+1 = 10n 2 + 9n + 6 4n 2 + 8n x n - 4n 2 + 3 2n 2 + 4n x n-1 + 2n -1 4n + 8 x n-2 - n 2n + 4 sf ( 2n -3 n x n - n -3 n x n-1 ) . We use linear approximation f ( x n + n -3 n (x n -x n-1 ) ) = ∇F ( x n + n -3 n (x n -x n-1 ) ) ≈ ∇ 2 F • ( x n + n -3 n (x n -x n-1 ) ) , and the characteristic equation of this finite scheme is approximately λ 3 - ( 10n 2 + 9n + 6 4n 2 + 8n -s • ∇ 2 F • 2n 2 -3n 2n 2 + 4n ) λ 2 + ( 4n 2 + 3 2n 2 + 4n -s • ∇ 2 F • n 2 -3n 2n 2 + 4n ) λ- 2n -1 4n + 8 = 0. For large n, we can ignore the high order terms and the characteristic equation becomes λ 3 - ( 5 2 -s • ∇ 2 F • ) λ 2 + ( 2 - s 2 • ∇ 2 F ) λ - 1 2 = 0. According to the absolute stability theory, the numerical stability of Nesterov's scheme with respect to accumulated roundoff error is equivalent to this: all the roots of the characteristic equation lie in the unit circle (Leader, 2004) . Noticing that the left hand of the equation can be factorized to ( λ - 1 2 ) ( λ 2 -(2 -s • ∇ 2 F )λ + 1 ) , the largest modulu of the roots is 1 when 0 ≤ s • ∇ 2 F ≤ 4, and the absolutely stable region of the new method is s • ∇ 2 F ∈ [0, 4]. When s•∇ 2 F lies in the absoletely stable region, the related theory guarantees that the error caused by every iteration will not be magnified as the iteration number increases. To make the analysis more precise, we should consider the difference of the scheme between iterations caused by different n. We define the transfer matrix P n =    ( 10n 2 +9n+6 4n 2 +8n -s • ∇ 2 F • 2n 2 -3n 2n 2 +4n ) - ( 4n 2 +3 2n 2 +4n -s • ∇ 2 F • n 2 -3n 2n 2 +4n ) 2n-1 4n+8 1 0 0 0 1 0    and Q n = P n P n-1 • • • P 1 . Error analysis shows that if the largest modulu of eigenvalues of Q n goes to zero, then error caused by iterations will be eliminated as the iteration number increases. Figure 2 presents the largest module of eigenvalues of Q n for different values of s • ∇ 2 F . From the experiment we can see that the above condition is satisfied. We then apply the same method to Nesterov's method discussed in (Su et al., 2014) and conclude that the absolutely stable region of Nesterov's method is [0, 4 3 ]. According to the above analysis, the absolutely stable region of the new method is four times as large as Nesterov's method, so the new method is more stable, and we can choose larger step sizes to achieve faster convergence. 4 Application to matrix completion problem: modified FISTA Our theory and numerical results show that the new method is more stable than original Nestrov's method. So we can choose larger step size for new method and convergence to the optimal solution can be faster, compared with original Nesterov's method. In this section we apply the new method to matrix completion problem. We present a new algorithm which can be viewed as a modification of the well-konwn fast iterative shrinkage-thresholding algorithm (FISTA) (Beck & Teboulle, 2009) . The performance of modified FISTA can also confirm the advantage of the new method. For matrix completion problem there exists a 'true' low rank matrix M . We are given some entries of M and asked to fill missing entries. There have been various algorithms to solve such problem (Candès & Recht, 2009; Keshavan et al., 2010) . Besides, it is proposed that matrix completion can be transformed to the following unconstrained optimization problem (Mazumder et al., 2010) min F (X) = 1 2 ∥X obs -M obs ∥ 2 + λ∥X∥ * . (4.1) Notice that F (X) is composed of a smooth term and a non-smooth term, so gradient-based algorithms cannot be used directly. Proximal gradient algorithms (Parikh & Boyd, 2014) are widely used in such composite optimization problems, and fast iterative shrinkage-thresholding algorithm (FISTA) is a successful algorithm. Moreover, FISTA has been extended to matrix completion case (Ji & Ye, 2009) . For convenience, we set G(X) = 1 2 ∥X obs -M obs ∥ 2 , H(X) = λ∥X∥ * , and g(X) = ∇G(X). The idea of FISTA builds on Nesterov's method. We also apply acclerated proximal gradient method (Parikh & Boyd, 2014) for our numerical experiment, which is composed of Nesterov's method and proximal gradient descent. These two algorithms are presented in Appendix A.5. We find the performances of them are similar in our experiments. Our contribution is the third method (Algorithm 2), the new method (3.2) combined with proximal operator, which we call modified FISTA. Algorithm 2 Modified FISTA Input: step size s Initial value: X 2 = X 1 = X 0 ∈ M 100 . (k -1)th iteration (k ≥ 2). Compute Y k = 10k 2 + 9k + 6 4k 2 + 8k X k - 4k 2 + 3 2k 2 + 4k X k-1 + 2k -1 4k + 8 X k-2 , Z k = 2k -3 k X k - k -3 k X k-1 , X k+1 = arg min X { 1 2 • 2k + 4 ks X - ( Y k - ks 2k + 4 g(Z k ) ) 2 + λ∥X∥ * } . Notice that the minimizing problems in interations of above three algorithms can be solved directly by singular value decomposition (Cai & Candès, 2010) . We take experiments on a simulated data set. We use fixed step sizes in the above three algorithms, and the performances are presented in Figure 3 . We find empirically that for all methods, convergence is faster when step size is larger, so we choose the largest step sizes for all methods to compare their fastest convergence speed. Through experiments, we find the largest step size that makes modified FISTA convergent is 4.1 (accurate to one decimal place), while those for the first two algorithms are both 1.3. We also compare performances of the three methods with step sizes reduced from the largest in equal proportion. We find that when step sizes are chosen to be the largest or reduced from the largest in equal proportion (80%, 50%, 10%), convergence of modified FISTA is faster than the other two methods. We also combine the three methods with backtracking (Beck & Teboulle, 2009) to choose step sizes automatically. We present modified FISTA with backtracking as Algorithm 3, and the other two algorithms are similar. Performances of the three algorithms with backtracking on abovementioned data set are presented in Figure 4 . Convergence of modified FISTA is faster than the other two methods. Moreover, we find that the final step size of modified FISTA is larger.

5. Discussion

In this paper we prove that iterations of Nesterov's accelerated method converge to solution of the derived differential equation as step sizes go to zero. We present a new accelerated method to make this convergence faster. We use numerical results to show that the new method is more stable, especially for large step sizes, and explan it using the order of truncation error. We then apply the new method to matrix completion problem and present a new algorithm which we call modified FISTA. Numerical experiments show that modified FISTA performs better than existing algorithms based on Nesterov's acceleration because it can work with larger step sizes. We will also combine our new method with stochastic gradient-based algorithms and apply the new method to deep networks in the future. Algorithm 3 Modified FISTA with backtracking Input: some β < 1 Initial value. X 2 = X 1 = X 0 ∈ M 100 , step size s 2 . (k -1)th iteration (k ≥ 2). Y k = 10k 2 + 9k + 6 4k 2 + 8k X k - 4k 2 + 3 2k 2 + 4k X k-1 + 2k -1 4k + 8 X k-2 , Z k = 2k -3 k X k - k -3 k X k-1 . Find the smallest positive integer i k+1 such that with s = β i k+1 s k F ( X) < F (Y k ) + ⟨ X -Y k , g(Z k ) ⟩ + 1 2 • 2k + 4 ks ∥ X -Y k ∥ 2 , where  X = arg min X { 1 2 • 2k + 4 ks X - ( Y k - ks 2k + 4 g(Z k ) ) 2 + λ∥X∥ * } . Set s k+1 = β i k+1 s k and compute X k+1 = X. (x n ) -F (x * )|. Step sizes are chosen by backtracking. Our work shows that for an accelerated gradient method, the rate at which it converges to the derived differential equation is possibly related to its property as an optimization algorithm. We think this work suggests that more consideration should be given to the corresponding differential equations when studying optimization algorithms.

A Appendix

A.1 Proof of Theorem 1 Theorem 1. Assume f satisfies L-Lipschitz condition, and solution x(t) of the derived differential equation (1.2) has a continuous third derivative. For fixed time t, the truncation error (2.2) satisfies L[x(t); h] = O(h 3 ). (A.1) Proof. Notice that x(t -h) = x(t) + O(h). Substitute this equation to the last term of L[x(t); h] to get f ( x(t) + t -3h t (x(t) -x(t -h)) ) = f ( x(t) + t -3h t • O(h) ) . Since f satisfies L-Lipschitz condition, we know f ( x(t) + t -3h t (x(t) -x(t -h)) ) =f (x(t)) + O(h) = -ẍ(t) - 3 t ẋ(t) + O(h). To get the second equality, we substitute the differential equation (1.2). Then we expend the first and third terms of L[x(t); h] to third order to get x(t + h) = x(t) + hx (1) (t) + h 2 2 x (2) (t) + O(h 3 ), x(t -h) = x(t) -hx (1) (t) + h 2 2 x (2) (t) + O(h 3 ). Substitute these three equations to (2.2), we have

L[x(t)

; h] = O(h 3 ). Remark 1. (A.1) can also be written as |L[x(t); h]| ≤ M 1 h 3 , where M 1 depends on sup s≤t |x (1) (s)| and sup s≤t |x (3) (s)|. Remark 2. Theorem 1 deals with the problem for fixed time t. To finish the proof of the convergence, we have to consider the situation that t n = nh, where n ≥ 1 is fixed. We set a fixed time t 0 and assume that t n = nh < t 0 . Since x(t) has a continuous third derivative, x(t) and its first to third derivative are bounded in [0, t 0 ]. We replace time t in the above proof by t n and expend the terms of (2.2). Now the term - 3h 3 2t n x (2) (t n ) obtained from the expansion of x(t n-1 ) cannot be viewed as O(h 3 ), but there exists M 2 > 0 such that - 3h 3 2t n x (2) (t n ) ≤ M 2 h 2 n . As a consequence, we have |L[x(t n ); h]| ≤ M 1 h 3 + M 2 h 2 n , (A.2) where M 1 and M 2 rely on t 0 . A.2 Two lemmas for Theorem 2 For the proof of Theorem 2, we need the following two lemmas. Lemma 1. (Holte, 2009) For constant α, β > 0 and positive sequence {η n } n≥0 satisfying η n ≤ β + α n-1 ∑ i=0 η i , ∀n > 0, the following inequality holds η n ≤ e αn (β + αη 0 ). The above lemma is a classic result and refered to as discrete Gronwall inequality. Lemma 2. We define matrices C n and D n,l as C n = ( 2n-1 n+1 -n-2 n+1 1 0 ) , D n,l = C n C n-1 • • • C n-l+1 , where n ≥ 0 and 0 < l ≤ n + 1. In addition, we set D n,0 = I 2 . Then there exist positive constants M, M 3 such that for all n, the following two inequalities hold, where the matrix norm is 2-norm. sup 0≤l≤n+1 ∥D n,l ∥ ≤ M n, D n,n+1 ≤ M 3 . (A.3) Proof. Since C 2 = ( 1 0 1 0 ) , we notice that when n ≥ 2, D n,n-1 = ( 1 0 1 0 ) , D n,n = ( 1 2 1 2 1 2 1 2 ) , D n,n+1 = ( 0 1 0 1 ) , having nothing to do with the value of n. So it is obvious that there exists M 3 to make (A.3) holds and M 4 > 0 such that for all n < 2 or n ≥ 2, l > n -2 or l = 0, ∥D n,l ∥ ≤ M 4 n. (A.4) Then we consider the condition when n ≥ 2, 0 < l ≤ n -2. Notice that C n = ( 1 1 1 0 ) ( 1 1 0 n-2 n+1 ) ( 1 1 1 0 ) -1 . For convenience, we write P = ( 1 1 1 0 ) . Assume we have alreagy got D n,l = P ( 1 a n,l 0 b n,l ) P -1 satisfying 0 < a n,l ≤ l, 0 < b n,l ≤ 1, then since D n,l+1 = D n,l C n-l , and 0 ≤ n-l-2 n-l+1 < 1, D n,l+1 has the same form D n,l+1 = P ( 1 a n,l+1 0 b n,l+1 ) P -1 , satisfying 0 < a n,l+1 ≤ l + 1, 0 < b n,l ≤ 1. Then for fixed n, induce from l = 1, we get D n,l = P D n,l P -1 ≜ P ( 1 a n,l 0 b n,l ) P -1 , satisfying 0 < a n,l ≤ l ≤ n, 0 < b n,l ≤ 1, (A.5) for all n ≥ 2, 0 < l ≤ n -2. Then we can estimate ∥D n,l ∥. Notice that D n,l D T n,l = ( 1 + a 2 n,l a n,l b n,l a n,l b n,l a 2 n,l ) . The eigenvalues of this matrix are λ 1,2 = 1 + a 2 n,l + b 2 n,l ± √ (1 + a 2 n,l + b 2 n,l ) 2 -4b 4 2 . Combining this representation with (A.5), we get the estimation ∥ D n,l ∥ = √ max{|λ 1 |, |λ 2 |} ≤ √ 1 + a 2 n,l + b 2 n,l ≤ n + 2. So there exists M 5 > 0, such that for all n ≥ 2, 0 < l ≤ n -2, inequality ∥D n,l ∥ ≤ M 5 n (A.6) holds. Combining (A.4) with (A.6), we finish the proof. A.3 Proof of Theorem 2 Theorem 2. Under conditions in Theorem 1, for fixed time t, x t/h converges to x(t) as h goes to zero at a rate of O(h ln 1 h ) if x 0 = x(0) and x 1 = x(h). Proof. In this proof, we first calculate the error caused by a single iteration, which can be divided into an accumulation term and a truncation term. Then we use the estimation given by Theorem 1 and apply discrete Gronwall inequality to prove the convergence. Recall the recurrence relation x n+1 = x n + n -3 n (x n -x n-1 ) -h 2 • f ( x n + n -3 n (x n -x n-1 ) ) and the definition of truncation error x(t n+1 ) = x(t n ) + n -3 n (x(t n ) -x(t n-1 )) -h 2 f ( x(t n ) + n -3 n (x(t n ) -x(t n-1 )) ) + L[x(t n ); h], where t n = nh. Subtract the above two equations, and introduce overall error e n = x(t n ) -x n , we have e n+1 = 2n -3 n e n - n -3 n e n-1 -h 2 b n-1 + L[x(t n ); h], which can also be written as e n+2 - 2n -1 n + 1 e n+1 + n -2 n + 1 e n = -h 2 b n + L[x(t n+1 ); h], (A.7) where b n = f ( 2n -1 n + 1 x n+1 - n -2 n + 1 x n ) -f ( 2n -1 n + 1 x(t n+1 ) - n -2 n + 1 x(t n ) ) . (A.8) We will also use the notation b * n = - e n+2 -2n-1 n+1 e n+1 + n-2 n+1 e n h 2 . Then we rewrite (A.7) into a form that is convenient for recurrence. We set E n = ( e n+1 e n ) , C n = ( 2n-1 n+1 -n-2 n+1 1 0 ) , B n = ( -h 2 b * n 0 ) . Then (A.7) can be written as E n+1 = C n E n + B n . By recursive method, we have E n = C n-1 • • • C 0 E 0 + n ∑ l=1 C n-1 • • • C n-l+1 B n-l . With the notations introduced in Lemma 2, this equation can be written as E n = D n-1,n E 0 + n ∑ l=1 D n-1,l-1 B n-l . (A.9) Now we need to estimate ∥B n ∥. Since f satisfies L-Lipschitz condition, from (A.8) we have |b n | ≤ L ( 2n -1 n + 1 |e n+1 | + n -2 n + 1 |e n | ) ≤ L (2|e n+1 | + |e n |) ≤ 3L∥E n ∥. and ∥B n ∥ ≤ 3h 2 L∥E n ∥ + L[x(t n+1 ); h]. (A.10) Take norm on both sides of (A.9) and substitute (A.10) and conclusion of Lemma 2, we have the following estimation ∥E n ∥ ≤ M 3 ∥E 0 ∥ + M (n -1) n-1 ∑ l=0 ( 3h 2 L∥E l ∥ + L[x(t l+1 ); h] ) ≤ M 3 ∥E 0 ∥ + 3M nh 2 L n-1 ∑ l=0 ∥E l ∥ + M n n-1 ∑ l=0 L[x(t l+1 ); h]. (A.11) Now we deal with truncation errors. Recall (A.2) in remark of Theorem 1 |L[x(t l ); h]| ≤ M 1 h 3 + M 2 h 2 l . Take sum to obtain n-1 ∑ l=0 |L[x(t l+1 ); h]| ≤ nM 1 h 3 + M 2 h 2 n-1 ∑ l=0 1 l + 1 . (A.12) Notice the classic inequality n ∑ i=1 1 i ≤ ln n + M e , where M e refers to a positive constant. Substitute it to (A.12), we have n-1 ∑ l=0 |L[x(t l+1 ); h]| ≤ nM 1 h 3 + M 2 h 2 (ln n + M e ). Substitute this inequality to (A.11), we get a control of ∥E n ∥ ∥E n ∥ ≤ M 3 ∥E 0 ∥ + 3M nh 2 L n-1 ∑ l=0 ∥E l ∥ + M M 1 n 2 h 3 + M M 2 M e nh 2 + M M 2 nh 2 ln n Using discrete Gronwall inequality, we have ∥E n ∥ ≤ e 3M n 2 h 2 L ( M 3 ∥E 0 ∥ + M M 1 n 2 h 3 + M M 2 M e nh 2 + M M 2 nh 2 ln n + 3M nh 2 L∥E 0 ∥ ) . Then for fixed t, we choose n = ) ,



Figure 1: Iterations of original Nesterov's method (Nesterov) and the new method (New method) for quodratic and linear regression objective function. Y-axis represents the gap |F (x n ) -F (x * )|. In Figure 1(a), step size s = 0.03705; in Figure 1(b), step size s = 0.00565.

Figure2: The largest modulu of eigenvalues of Q n , where s • ∇ 2 F is chosen to be 1, 2 and 3.

Figure 3: Iterations of FISTA, accelerated proximal gradient descent (Nesterov) and modified FISTA (Modified FISTA) for matrix completion objective function. Y-axis represents the gap |F (x n ) -F (x * )|. In Figure 3(a), step size is 1.3 for FISTA and accelerated proximal gradient descent, and 4.1 for modified FISTA. In the other three figures, step sizes are reduced from 1.3 and 4.1 in the proportion marked below the figures.

Figure 4: Iterations of FISTA, accelerated proximal gradient descent (Nesterov) and modified FISTA (Modified FISTA) for matrix completion objective function. Y-axis represents the gap |F (x n ) -F (x * )|. Step sizes are chosen by backtracking.

.4 Proof of Theorem 3 Theorem 3. If f has continuous second order derivative, the first and second derivative are bounded, and x(t) has continuous fourth derivative, then for fixed t, truncation error of (3.2) satisfiesL[x(t); h] = O(h 4 ).Proof. Recall the proof of Throrem 1. Now we expand x(t -h) to first orderx(t -h) = x(t) + hx (1) (t) + O(h 2 ). (x(t)) + hx (1) (t)f (1) (x(t)) + O(h 2 ).To do this, we need f has continuous second derivative and the second derivative is bounded. Take derivetive on both sides of differential equation

annex

then simple calculation shows that terms with order less than four will be eliminated if we choose coefficients according to the following equationswhere k, m 1 , m 2 can be chosen randomly. Notice that coefficients of recurrence relation (3.2) satisfy above equations.

A.5 Algorithms

Algorithm 4 FISTA Input: step size s Initial value:Algorithm 5 Accelerated proximal gradient method Input: step size s Initial value:

A.6 Details about Numerical Experiments in Section 4

Here we produce some details for our numerical experiments in Section 4.Our experiments are taken on a simulated data set. Firstly, we generated the 'true' low rank matrix M . To do this, we generate a random matrix M 0 . Entries of M 0 are independent and uniformly distributed on (0, 20). Then we compute the singular value decomposition of M 0 , that is, M 0 = U ΣV T . After that, we set M = U Σ 0 V T , where Σ 0 is a diagonal matrix with only three nonzero diagonal elements. It is not difficult to prove that M has rank 3.Secondly, we generate the observation set. For every row of M , we choose randomly ten entrys to be observed. As a consequence, 10% entries are observed in total.After data generation step, we apply the abovementioned algorithms (accelerated proximal gradient method, FISTA and our modified FISTA) with fixed step sizes and backtracking to this data set. The parameter of the loss function (4.1) is λ = 0.005. For initial point, we simply choose the zero matrix (every entry equals to zero). For backtracking, we set the initial step size as 10 and the decay factor β = 0.1.

