A New Accelerated Gradient Method Inspired by Continuous-Time Perspective

Abstract

Nesterov's accelerated method are widely used in problems with machine learning background including deep learning. To give more insight about the acceleration phenomenon, an ordinary differential equation was obtained from Nesterov's accelerated method by taking step sizes approaching zero, and the relationship between Nesterov's method and the differential equation is still of research interest. In this work, we give the precise order of the iterations of Nesterov's accelerated method converging to the solution of derived differential equation as step sizes go to zero. We then present a new accelerated method with higher order. The new method is more stable than ordinary method for large step size and converges faster. We further apply the new method to matrix completion problem and show its better performance through numerical experiments.

1. Introduction

Optimization is a core component of statistic and machine learning problems. Recently, gradientbased algorithms are widely used in such optimization problems due to its simplicity and efficiency for large-scale situations. For solving convex optimization problem min x∈R d F (x), where F (x) is convex and sufficiently smooth, a classical first-order method is gradient descent. We assume that f (x) = ∇F (x) satisfies L-Lipschitz condition, that is, there exists constant L such that ∥f (x) -f (y)∥ ≤ L∥x -y∥, ∀x, y. Under these conditions, gradient descent achieves a convergence rate of O(n -1 ), i.e., ∥F (x n ) -F (x * )∥ decreases to zero at a rate of O(n -1 ), where x n denotes the nth iteration and x * denotes the minimum point of F (x) in R d . Nesterov's accelerated method (Nesterov, 1983 ) is a more efficient first-order algorithm than gradient descent, of which we will use the following form: starting with x 0 = x 1 , y n = x n + n -3 n (x n -x n-1 ), x n+1 = y n -sf (y n ) (1.1) for n ≥ 1. It is shown that under abovementioned conditions, Nesterov's accelerated method converges at a rate of O(n -2 ). Accelerated gradient method has been successful in training deep and recurrent neural networks (Sutskever et al., 2013) and is widely used in problems with machine learning background to avoid sophisticated second-order methods (Cotter et al., 2011; Hu et al., 2009; Ji & Ye, 2009) . To provide more theorical understanding, an important research topic of Nesterov's accelerated method is to find an explanation of the acceleration. On this topic, Nesterov's method was studied via a continuous-time perspective (Su et al., 2014) . They considered a curve x(t), introduced the ansatz x n ≈ x(n √ s) and substituted it to (1.1). Letting s → 0, they obtained the following differential equation. ẍ + 3 t ẋ + f (x) = 0. (1.2) The differential equation was used as a tool for analyzing and generalizing Nesterov's scheme. Furthermore, this idea has been studied from different directions. A class of accelerated methods have been generated in continuous-time (Wibisono et al., 2016) . ODE (1.2) can also be discretized directly using Runge-Kutta method to achieve acceleration (Zhang et al., 2018) . Although many results have been achieved, the process of obtaining the differential equation (1.2) has not been rigorous, and the method is still time-consuming for large-scale problems. In this work, we give the precise order of the iterations of Nesterov's accelerated method converging to solution of the differential equation (1.2) with initial conditions x(0) = x 0 , ẋ(0) = 0 (1.3) as step size s goes to zero. Inspired from this perspective, we present a new accelerated method to make this convergence faster. As we expected, iterations of the new method are closer to the solution x(t) of differential equation (1.2) than original Nesterov's method. Moreover, we find the new method is more stable than original Nesterov's method when step size is large. Based on abovementioned observations, we try to take advantage of the new method in more practical problems. We apply the new method to matrix completion problem. We combine the new method with proximal operator (Parikh & Boyd, 2014) into a new algorithm, which we call modified FISTA. We find that the new method performs better than FISTA (Beck & Teboulle, 2009) and acclerated proximal gradient method (Parikh & Boyd, 2014) because it can work with larger step sizes. This paper is organized as follows. In section 2, we prove that iterations of Nesterov's accelerated method converge to solution of the differential equation (1.2). In section 3, we present a new method to make the convergence faster and show its better stablity through two simple examples. In section 4, we apply the new method to matrix completion problem. 2 A strict analysis of the relation between Nesterov's method and its continuous-time limit We refer to x(t) as the solution of differential equation (1.2) with initial conditions (1.3). Existance and uniqueness of such solutions have been proved (Su et al., 2014) . In this section, We give the order of the iterations of Nesterov's accelerated method converging to x(t) as step sizes go to zero. For convenience, we substitute the first equation in Nesterov's method (1.1) to the second one to get x n+1 = x n + n -3 n (x n -x n-1 ) -s • f ( x n + n -3 n (x n -x n-1 ) ) . We write s = h 2 and rewrite the above recurrence relation as x n+1 = x n + n -3 n (x n -x n-1 ) -h 2 • f ( x n + n -3 n (x n -x n-1 ) ) . (2.1)

