ACCELERATION IN HYPERBOLIC AND SPHERICAL SPACES

Abstract

We further research on the acceleration phenomenon on Riemannian manifolds by introducing the first global first-order method that achieves the same rates as accelerated gradient descent in the Euclidean space for the optimization of smooth and geodesically convex (g-convex) or strongly g-convex functions defined on the hyperbolic space or a subset of the sphere, up to constants and log factors. To the best of our knowledge, this is the first method that is proved to achieve these rates globally on functions defined on a Riemannian manifold M other than the Euclidean space. Additionally, for any Riemannian manifold of bounded sectional curvature, we provide reductions from optimization methods for smooth and gconvex functions to methods for smooth and strongly g-convex functions and vice versa.

1. INTRODUCTION

Acceleration in convex optimization is a phenomenon that has drawn lots of attention and has yielded many important results, since the renowned Accelerated Gradient Descent (AGD) method of Nesterov (1983) (2016) . This surge of research that applies tools of convex optimization to models going beyond convexity has been fruitful. One of these models is the setting of geodesically convex Riemannian optimization. In this setting, the function to optimize is geodesically convex (g-convex), i.e. convex restricted to any geodesic (cf. Definition 1.1). Riemannian optimization, g-convex and non-g-convex alike, is an extensive area of research. In recent years there have been numerous efforts towards obtaining Riemannian optimization algorithms that share analogous properties to the more broadly studied Euclidean first-order methods: deterministic de Carvalho Bento et al. ( 2017 However, the acceleration phenomenon, largely celebrated in the Euclidean space, is still not understood in Riemannian manifolds, although there has been some progress on this topic recently (cf. Related work). This poses the following question, which is the central subject of this paper: Can a Riemannian first-order method enjoy the same rates as AGD in the Euclidean space? In this work, we provide an answer in the affirmative for functions defined on hyperbolic and spherical spaces, up to constants depending on the curvature and the initial distance to an optimum, and up to log factors. In particular, the main results of this work are the following.

Main Results:

• Full acceleration. We design algorithms that provably achieve the same rates of convergence as AGD in the Euclidean space, up to constants and log factors. More precisely, we obtain the rates O(L/ √ ε) and O * ( L/µ log(µ/ε)) when optimizing L-smooth functions that are, respectively, g-convex and µ-strongly g-convex, defined on the hyperbolic space or a subset of the sphere. The notation O(•) and O * (•) omits log(L/ε) and log(L/µ) factors, respectively, and constants. Previous approaches only showed local results Zhang & Sra (2018) or obtained results with rates in between the ones obtainable by Riemannian Gradient Descent (RGD) and AGD Ahn & Sra (2020). Moreover, these previous works only apply to functions that are smooth and strongly g-convex and not to smooth functions that are only g-convex. As a proxy, we design an accelerated algorithm under a condition between of convexity and quasar-convexity in the constrained setting, which is of independent interest. • Reductions. We present two reductions for any Riemannian manifold of bounded sectional curvature. Given an optimization method for smooth and g-convex functions they provide a method for optimizing smooth and strongly g-convex functions, and vice versa. This allows to focus on designing methods for one set of assumptions only. It is often the case that methods and key geometric inequalities that apply to manifolds with bounded sectional curvatures are obtained from the ones existing for the spaces of constant extremal sectional curvature Grove et al. (1997) ; Zhang & Sra (2016; 2018) . Consequently, our contribution is relevant not only because we establish an algorithm achieving global acceleration on functions defined on a manifold other than the Euclidean space, but also because understanding the constant sectional curvature case is an important step towards understanding the more general case of obtaining algorithms that optimize g-convex functions, strongly or not, defined on manifolds of bounded sectional curvature. Our main technique for designing the accelerated method consists of mapping the function domain to a subset B of the Euclidean space via a geodesic map: a transformation that maps geodesics to geodesics. Given the gradient of a point x ∈ M, which defines a lower bound on the function that is linear over the tangent space of x, we find a lower bound of the function that is linear over B, despite the map being non-conformal, deforming distances, and breaking convexity. This allows to aggregate the lower bounds easily. We believe that effective lower bound aggregation is key to achieving Riemannian acceleration and optimality. Using this strategy, we are able to provide an algorithm along the lines of the one in Diakonikolas & Orecchia (2018) to define a continuous method that we discretize using an approximate implementation of the implicit Euler method, obtaining a method achieving the same rates as the Euclidean AGD, up to constants and log factors. Our reductions take into account the deformations produced by the geometry to generalize existing Euclidean reductions Allen Zhu & Hazan (2016); Allen Zhu & Orecchia (2017). Basic Geometric Definitions. We recall basic definitions of Riemannian geometry that we use in this work. For a thorough introduction we refer to Petersen et al. (2006) . A Riemannian manifold (M, g) is a real smooth manifold M equipped with a metric g, which is a smoothly varying inner product. For x ∈ M and any two vectors v, w ∈ T x M in the tangent space of M, the inner product v, w x is g(v, w). For v ∈ T x M, the norm is defined as usual v x def = v, v x . Typically, x is known given v or w, so we will just write v, w or v if x is clear from context. A geodesic is a curve γ : [0, 1] → M of unit speed that is locally distance minimizing. A uniquely geodesic space is a space such that for every two points there is one and only one geodesic that joins them. In such a case the exponential map Exp x : T x M → M and inverse exponential map Exp -1 x : M → T x M are well defined for every pair of points, and are as follows. Given x, y ∈ M, v ∈ T x M, and a



. Having been proved successful for deep learning Sutskever et al. (2013), among other fields, there have been recent efforts to better understand this phenomenon Allen Zhu & Orecchia (2017); Diakonikolas & Orecchia (2019); Su et al. (2016); Wibisono et al. (2016). These have yielded numerous new results going beyond convexity or the standard oracle model, in a wide variety of settings Allen-Zhu (2017; 2018a;b); Allen Zhu & Orecchia (2015); Allen Zhu et al. (2016); Allen-Zhu et al. (2017); Carmon et al. (2017); Cohen et al. (2018); Cutkosky & Sarlós (2019); Diakonikolas & Jordan (2019); Diakonikolas & Orecchia (2018); Gasnikov et al. (2019); Wang et al.

); Wei et al. (2016); Zhang & Sra (2016), stochastic Hosseini & Sra (2017); Khuzani & Li (2017); Tripuraneni et al. (2018), variance-reduced Sato et al. (2017; 2019); Zhang et al. (2016), adaptive Kasai et al. (2019), saddle-point-escaping Criscitiello & Boumal (2019); Sun et al. (2019); Zhang et al. (2018); Zhou et al. (2019); Criscitiello & Boumal (2020), and projection-free methods Weber & Sra (2017; 2019), among others. Unsurprisingly, Riemannian optimization has found many applications in machine learning, including low-rank matrix completion Cambier & Absil (2016); Heidel & Schulz (2018); Mishra & Sepulchre (2014); Tan et al. (2014); Vandereycken (2013), dictionary learning Cherian & Sra (2017); Sun et al. (2017), optimization under orthogonality constraints Edelman et al. (1998), with applications to Recurrent Neural Networks Lezcano-Casado (2019); Lezcano-Casado & Martínez-Rubio (2019), robust covariance estimation in Gaussian distributions Wiesel (2012), Gaussian mixture models Hosseini & Sra (2015), operator scaling Allen-Zhu et al. (2018), and sparse principal component analysis Genicot et al. (2015); Huang & Wei (2019b); Jolliffe et al. (2003).

