A New Accelerated Gradient Method Inspired by Continuous-Time Perspective

Abstract

Nesterov's accelerated method are widely used in problems with machine learning background including deep learning. To give more insight about the acceleration phenomenon, an ordinary differential equation was obtained from Nesterov's accelerated method by taking step sizes approaching zero, and the relationship between Nesterov's method and the differential equation is still of research interest. In this work, we give the precise order of the iterations of Nesterov's accelerated method converging to the solution of derived differential equation as step sizes go to zero. We then present a new accelerated method with higher order. The new method is more stable than ordinary method for large step size and converges faster. We further apply the new method to matrix completion problem and show its better performance through numerical experiments.

1. Introduction

Optimization is a core component of statistic and machine learning problems. Recently, gradientbased algorithms are widely used in such optimization problems due to its simplicity and efficiency for large-scale situations. For solving convex optimization problem min x∈R d F (x), where F (x) is convex and sufficiently smooth, a classical first-order method is gradient descent. We assume that f (x) = ∇F (x) satisfies L-Lipschitz condition, that is, there exists constant L such that ∥f (x) -f (y)∥ ≤ L∥x -y∥, ∀x, y. Under these conditions, gradient descent achieves a convergence rate of O(n -1 ), i.e., ∥F (x n ) -F (x * )∥ decreases to zero at a rate of O(n -1 ), where x n denotes the nth iteration and x * denotes the minimum point of F (x) in R d . Nesterov's accelerated method (Nesterov, 1983 ) is a more efficient first-order algorithm than gradient descent, of which we will use the following form: starting with x 0 = x 1 , y n = x n + n -3 n (x n -x n-1 ), x n+1 = y n -sf (y n ) (1.1) for n ≥ 1. It is shown that under abovementioned conditions, Nesterov's accelerated method converges at a rate of O(n -2 ). Accelerated gradient method has been successful in training deep and recurrent neural networks (Sutskever et al., 2013) and is widely used in problems with machine learning background to avoid 1

