FIRST-ORDER OPTIMIZATION ALGORITHMS VIA DIS-CRETIZATION OF FINITE-TIME CONVERGENT FLOWS

Abstract

In this paper the performance of several discretization algorithms for two first-order finite-time optimization flows. These flows are, namely, the rescaled-gradient flow (RGF) and the signed-gradient flow (SGF), and consist of non-Lipscthiz or discontinuous dynamical systems that converge locally in finite time to the minima of gradient-dominated functions. We introduce three discretization methods for these first-order finite-time flows, and provide convergence guarantees. We then apply the proposed algorithms in training neural networks and empirically test their performances on three standard datasets, namely, CIFAR10, SVHN, and MNIST. Our results show that our schemes demonstrate faster convergences against standard optimization alternatives, while achieving equivalent or better accuracy.

1. INTRODUCTION

Consider the unconstrained minimization problem for a given cost function f : R n → R. When f is sufficiently regular, the standard algorithm in continuous time (dynamical system) is given by ẋ = F GF (x) -∇f (x) with ẋ d dt x(t), known as the gradient flow (GF). Generalizing GF, the q-rescaled GF (q-RGF) Wibisono et al. (2016) given by ẋ = -c ∇f (x) ∇f (x) q-2 q-1 2 (2) with c > 0 and q ∈ (1, ∞] has an asymptotic convergence rate f (x(t)) -f (x ) = O 1 t q-1 under mild regularity, for x(0) -x > 0 small enough, where x ∈ R n denotes a local minimizer of f . However, we recently proved Romero & Benosman (2020) that the q-RGF, as well as our proposed q-signed GF (q-SGF) ẋ = -c ∇f (x) 1 q-1 1 sign(∇f (x)), (3) where sign(•) denotes the sign function, applied element-wise for (real-valued) vectors, are both finite-time convergent, provided that f is gradient dominated of order p ∈ (1, q). In particular, if f is strongly convex, then q-RGF and q-SGF is finite-time convergent for any q ∈ (2, ∞], since f must be gradient dominated of order p = 2.

CONTRIBUTION

In this paper, we explore three discretization schemes for the q-RGF (2) and q-SGF (3) and provide some convergence guarantees using results from hybrid dynamical control theory. In particular, we explore a forward-Euler discretization of RGF/SGF, followed by an explicit Runge-Kutta discretization, and finally a novel Nesterov-like discretization. We then test their performance on both synthetic and real-world data in the context of deep learning, namely, over the well-known datasets CIFAR10, SVHN, and MNIST. 2019). Many of these papers also focus on explicit mappings and matchings of convergence rates from the continuous-time domain into discrete time. For older work connecting ordinary differential equations (ODEs) and their numerical analysis, with optimization algorithms, see Botsaris (1978a; b) ; Zghier (1981); Snyman (1982; 1983); Brockett (1988); Brown (1989) . In Helmke & Moore (1994) , the authors studied relationships between linear programming, ODEs, and general matrix theory. Further, Schropp (1995) and Schropp & Singer (2000) explored several aspects linking nonlinear dynamical systems to gradient-based optimization, including nonlinear constraints. Tools from Lyapunov stability theory are often employed for this purpose, mainly because there already exists a rich body of work within the nonlinear systems and control theory community for this purpose. In particular, typically in previous works, one seeks asymptotically Lyapunov stable gradient-based systems with an equilibrium (stationary point) at an isolated extremum of the given cost function, thus certifying local convergence. Naturally, global asymptotic stability leads to global convergence, though such an analysis will typically require the cost function to be strongly convex everywhere. For physical systems, a Lyapunov function can often be constructed from first principles via some physically meaningful measure of energy (e.g., total energy = potential energy + kinetic energy). In optimization, the situation is somewhat similar in the sense that a suitable Lyapunov function may often be constructed by taking simple surrogates of the objective function as candidates. For instance, V (x) f (x) -f (x ) can be a good initial candidate. Further, if f is continuously differentiable and x is an isolated stationary point, then another alternative is V (x) ∇f (x) 2 . However, most fundamental and applied research conducted in systems and control regarding Lyapunov stability theory deals exclusively with continuous-time systems. Unfortunately, (dynamical) stability properties are generally not preserved for simple forward-Euler and sample-and-hold discretizations and control laws Stuart & Humphries (1998) . Furthermore, practical implementations of optimization algorithms in modern digital computers demand discrete-time. Nonetheless, it has been extensively noted that a vast amount of general Lyapunov-based results appear to have a discrete-time equivalent. In that sense, we aim here to start from the q-RGF, and q-SGF continuous flows, characterized by their Lyapunov-based finite-time convergence, and seek discretization schemes, which allow us to 'shadow' the solutions of these flows in discrete time, hoping to achieve an acceleration of the discrete methods inspired from the finite-time convergence characteristics of the continuous flows.

2. OPTIMIZATION ALGORITHMS AS DISCRETE-TIME SYSTEMS

Generalizing (1), (2), and (3), consider a continuous-time algorithm (dynamical system) modeled via an ordinary differential equation (ODE) ẋ = F (x) (4) for t ≥ 0, or, more generally, a differential inclusion ẋ(t) ∈ F(x(t)) (5) a.e. t ≥ 0 (e.g. for the q = ∞ case), such that x(t) → x as t → t . In the case of the q-RGF (2) and q-SGF (3) for f gradient dominated of order p ∈ (1, q), we have finite-time convergence, and thus t = t (x(0)) < ∞. Most of the popular numerical optimization schemes can be written in a state-space form (i.e., recursively), as X k+1 = F d (k, X k ) (6a) x k = G(X k ) (6b)



RELATED WORKPropelled by the work ofWang & Elia (2011)  andSu et al. (2014), there has been a recent and significant research effort dedicated to analyzing optimization algorithms from the perspective of dynamical systems and control theory, especially in continuous time Wibisono et al. (2016); Wilson (2018); Lessard et al. (2016); Fazlyab et al. (2017b); Scieur et al. (2017); Franc ¸a et al. (2018); Fazlyab et al. (2018); Fazlyab et al. (2018); Taylor et al. (2018); Franc ¸a et al. (2019a); Orvieto & Lucchi (2019); Muehlebach & Jordan (2019). A major focus within this initiative is in accceleration, both in terms of trying to gain new insight into more traditional optimization algorithms from this perspective, or even to exploit the interplay between continuous-time systems and their potential discretizations for novel algorithm design Muehlebach & Jordan (2019); Fazlyab et al. (2017a); Shi et al. (2018); Zhang et al. (2018); Franc ¸a et al. (2019b); Wilson et al. (

