CONTINUIZED ACCELERATION FOR QUASAR CONVEX FUNCTIONS IN NON-CONVEX OPTIMIZATION

Abstract

Quasar convexity is a condition that allows some first-order methods to efficiently minimize a function even when the optimization landscape is non-convex. Previous works develop near-optimal accelerated algorithms for minimizing this class of functions, however, they require a subroutine of binary search which results in multiple calls to gradient evaluations in each iteration, and consequently the total number of gradient evaluations does not match a known lower bound. In this work, we show that a recently proposed continuized Nesterov acceleration can be applied to minimizing quasar convex functions and achieves the optimal bound with a high probability. Furthermore, we find that the objective functions of training generalized linear models (GLMs) satisfy quasar convexity, which broadens the applicability of the relevant algorithms, while known practical examples of quasar convexity in non-convex learning are sparse in the literature. We also show that if a smooth and one-point strongly convex, Polyak-Łojasiewicz, or quadratic-growth function satisfies quasar convexity, then attaining an accelerated linear rate for minimizing the function is possible under certain conditions, while acceleration is not known in general for these classes of functions.

1. INTRODUCTION

Momentum has been the main workhorse for training machine learning models (Kingma & Ba, 2015; Wilson et al., 2017; Loshchilov & Hutter, 2019; Reddi et al., 2018; He et al., 2016; Simonyan & Zisserman, 2015; Krizhevsky et al., 2012) . In convex learning and optimization, several momentum methods have been developed under different machineries, which include the ones built on Nesterov's estimate sequence (Nesterov, 1983; 2013) , methods derived from ordinary differential equations and continuous-time techniques, (Krichene et al., 2015; Scieur et al., 2017; Attouch et al., 2018; Su et al., 2014; Wibisono et al., 2016; Shi et al., 2018; Diakonikolas & Orecchia, 2019) , approaches based on dynamical systems and control (Hu & Lessard, 2017; Wilson et al., 2021) , algorithms generated from playing a two-player zero-sum game via no-regret learning strategies (Wang et al., 2021a; Wang & Abernethy, 2018; Cohen et al., 2021) , and a recently introduced continuized acceleration (Even et al., 2021) . On the other hand, in the non-convex world, despite numerous empirical evidence confirms that momentum methods converge faster than gradient descent (GD) in several applications, see e.g., Sutskever et al. (2013) ; Leclerc & Madry (2020), first-order accelerated methods that provably find a global optimal point are sparse in the literature. Indeed, there are just few results showing acceleration over GD that we are aware. Wang et al. (2021b) show Heavy Ball has an accelerated linear rate for training an over-parametrized ReLU network and a deep linear network, where the accelerated linear rate has a square root dependency on the condition number of a neural tangent kernel matrix at initialization, while the linear rate of GD depends linearly on the condition number. A follow-up work of Wang et al. (2022) shows that Heavy Ball has an acceleration for minimizing a class of Polyak-Łojasiewicz functions (Polyak, 1963) . When the goal is not finding a global optimal point but a first-order stationary point, some benefits of incorporating the dynamic of momentum can be shown (Cutkosky & Orabona, 2019; Cutkosky & Mehta, 2021; Levy et al., 2021) . Nevertheless, theoretical-grounded momentum methods in non-convex optimization are still less investigated to our knowledge. With the goal of advancing the progress of momentum methods in non-convex optimization in mind, we study efficiently solving min w f (w), where the function f (•) satisfies quasar-convexity (Hinder et al., 2020; Hardt et al., 2018; Nesterov et al., 2019; Guminov & Gasnikov, 2017; Bu & Mesbahi, 2020) , which is defined in the following. Under quasar convexity, it can be shown that GD or certain momentum methods can globally minimize a function even when the optimization landscape is non-convex. Definition 1. (Quasar convexity) Let ρ > 0. Denote w * a global minimizer of f (•) : R d → R. The function f (•) is ρ-quasar convex if for all w ∈ R d , one has: f (w * ) ≥ f (w) + 1 ρ ∇f (w), w * -w . For µ > 0, the function f (•) is (ρ, µ)-strongly quasar convex if for all w ∈ R d , one has: f (w * ) ≥ f (w) + 1 ρ ∇f (w), w * -w + µ 2 w * -w 2 . ( ) For more characterizations of quasar convexity, we refer the reader to Hinder et al. ( 2020) (Appendix D in the paper), where a thorough discussion is provided. Recall that a function f on the number of gradient evaluations for minimizing quasar convex functions via any first-order deterministic methods is also established in Hinder et al. (2020) . The additional logarithmic factors in the (upper bounds of the) number of gradient evaluations, compared to the iteration complexity, result from a binary-search subroutine that is executed in each iteration to determine the value of a specific parameter of the algorithm. A similar concern applies to Bu & Mesbahi (2020) , where the algorithm assumes an oracle is available but its implementation needs a subroutine which demands multiple function and gradient evaluations in each iteration. Hence, the open questions are whether the additional logarithmic factors in the total number of gradient evaluations can be removed and whether function evaluations are necessary for an accelerated method to minimize quasar convex functions. (•) is L-smooth if f (x) ≤ f (y) + ∇f (y), x -y + L 2 x - V := f (w 0 ) -f (w * ) + µ 2 z 0 -w * 2 , We answer them by showing an accelerated randomized algorithm that avoids the subroutine, makes only one gradient call per iteration, and does not need function evaluations. Consequently, the complexity of gradient calls does not incur the additional logarithmic factors as the previous works, and, perhaps more importantly, the computational cost per iteration is significantly reduced. The proposed algorithms are built on the continuized discretization technique that is recently introduced by Even et al. ( 2021) to the optimization community, which offers a nice way to implement a continuous-time dynamic as a discrete-time algorithm. Specifically, the technique allows one to use differential calculus to design and analyze an algorithm in continuous time, while the discretization of the continuized process does not suffer any discretization error thanks to the fact that the Poisson process can be simulated exactly. Our acceleration results in this paper champion the approach, and provably showcase the advantage of momentum over GD for minimizing quasar convex functions. While previous works of quasar convexity are theoretically interesting, a lingering issue is that few examples are known in non-convex machine learning. While some synthetic functions are shown in previous works (Hinder et al., 2020; Nesterov et al., 2019; Guminov & Gasnikov, 2017) , the only practical non-convex learning applications that we are aware are given by Hardt et al. (2018) , where they show that for learning a class of linear dynamical systems, a relevant objective function over a convex constraint set satisfies quasar convexity, and by Foster et al. (2018) , where they show that a robust linear regression with Tukey's biweight loss and a GLM with an increasing link function satisfy quasar convexity, under the assumption that the link function has a bounded



y 2 for any x and y, where L > 0 is the smoothness constant. For minimizing L-smooth and ρ-quasar convex functions, the algorithm of Hinder et al. (2020) takes O L 1/2 w0-w * ρ 1/2 number of iterations and O L 1/2 w0-w gradient evaluations for getting an -optimality gap. For L-smooth and (ρ, µ)-strongly quasar convex functions, the algorithm of Hinder et al. (2020) takes O √

and w 0 and z 0 are some initial points. Both results of Hinder et al. (2020) improve those in the previous works of Nesterov et al. (2019) and Guminov & Gasnikov (2017) for minimizing quasar and strongly quasar convex functions. A lower bound Ω L 1/2 w0-w * ρ 1/2

