ACCELERATION IN HYPERBOLIC AND SPHERICAL SPACES

Abstract

We further research on the acceleration phenomenon on Riemannian manifolds by introducing the first global first-order method that achieves the same rates as accelerated gradient descent in the Euclidean space for the optimization of smooth and geodesically convex (g-convex) or strongly g-convex functions defined on the hyperbolic space or a subset of the sphere, up to constants and log factors. To the best of our knowledge, this is the first method that is proved to achieve these rates globally on functions defined on a Riemannian manifold M other than the Euclidean space. Additionally, for any Riemannian manifold of bounded sectional curvature, we provide reductions from optimization methods for smooth and gconvex functions to methods for smooth and strongly g-convex functions and vice versa. Acceleration in convex optimization is a phenomenon that has drawn lots of attention and has yielded many important results, since the renowned Accelerated Gradient Descent (AGD) method of Nesterov (1983). Having been proved successful for deep learning Sutskever et al. (2013), among other fields, there have been recent efforts to better understand this phenomenon Allen Zhu & Orecchia (2017); Diakonikolas & Orecchia (2019); Su et al. (2016); Wibisono et al. (2016). These have yielded numerous new results going beyond convexity or the standard oracle model, in a wide variety of settings Allen-Zhu (2017;



However, the acceleration phenomenon, largely celebrated in the Euclidean space, is still not understood in Riemannian manifolds, although there has been some progress on this topic recently (cf. Related work). This poses the following question, which is the central subject of this paper: Can a Riemannian first-order method enjoy the same rates as AGD in the Euclidean space? In this work, we provide an answer in the affirmative for functions defined on hyperbolic and spherical spaces, up to constants depending on the curvature and the initial distance to an optimum, and up to log factors. In particular, the main results of this work are the following.

Main Results:

• Full acceleration. We design algorithms that provably achieve the same rates of convergence as AGD in the Euclidean space, up to constants and log factors. More precisely, we obtain the rates O(L/ √ ε) and O * ( L/µ log(µ/ε)) when optimizing L-smooth functions that are, respectively, g-convex and µ-strongly g-convex, defined on the hyperbolic space or a subset of the sphere. The notation O(•) and O * (•) omits log(L/ε) and log(L/µ) factors, respectively, and constants. Previous approaches only showed local results Zhang & Sra (2018) or obtained results with rates in between the ones obtainable by Riemannian Gradient Descent (RGD) and AGD Ahn & Sra (2020) . Moreover, these previous works only apply to functions that are smooth and strongly g-convex and not to smooth functions that are only g-convex. As a proxy, we design an accelerated algorithm under a condition between of convexity and quasar-convexity in the constrained setting, which is of independent interest. • Reductions. We present two reductions for any Riemannian manifold of bounded sectional curvature. Given an optimization method for smooth and g-convex functions they provide a method for optimizing smooth and strongly g-convex functions, and vice versa. This allows to focus on designing methods for one set of assumptions only. It is often the case that methods and key geometric inequalities that apply to manifolds with bounded sectional curvatures are obtained from the ones existing for the spaces of constant extremal sectional curvature Grove et al. (1997) ; Zhang & Sra (2016; 2018) . Consequently, our contribution is relevant not only because we establish an algorithm achieving global acceleration on functions defined on a manifold other than the Euclidean space, but also because understanding the constant sectional curvature case is an important step towards understanding the more general case of obtaining algorithms that optimize g-convex functions, strongly or not, defined on manifolds of bounded sectional curvature. Our main technique for designing the accelerated method consists of mapping the function domain to a subset B of the Euclidean space via a geodesic map: a transformation that maps geodesics to geodesics. Given the gradient of a point x ∈ M, which defines a lower bound on the function that is linear over the tangent space of x, we find a lower bound of the function that is linear over B, despite the map being non-conformal, deforming distances, and breaking convexity. This allows to aggregate the lower bounds easily. We believe that effective lower bound aggregation is key to achieving Riemannian acceleration and optimality. Using this strategy, we are able to provide an algorithm along the lines of the one in Diakonikolas & Orecchia (2018) to define a continuous method that we discretize using an approximate implementation of the implicit Euler method, obtaining a method achieving the same rates as the Euclidean AGD, up to constants and log factors. Our reductions take into account the deformations produced by the geometry to generalize existing Euclidean reductions Allen Zhu & Hazan (2016) ; Allen Zhu & Orecchia (2017) . Basic Geometric Definitions. We recall basic definitions of Riemannian geometry that we use in this work. For a thorough introduction we refer to Petersen et al. (2006) . A Riemannian manifold (M, g) is a real smooth manifold M equipped with a metric g, which is a smoothly varying inner product. For x ∈ M and any two vectors v, w ∈ T x M in the tangent space of M, the inner product v, w x is g(v, w). For v ∈ T x M, the norm is defined as usual v x def = v, v x . Typically, x is known given v or w, so we will just write v, w or v if x is clear from context. A geodesic is a curve γ : [0, 1] → M of unit speed that is locally distance minimizing. A uniquely geodesic space is a space such that for every two points there is one and only one geodesic that joins them. In such a case the exponential map Exp x : T x M → M and inverse exponential map Exp -1 x : M → T x M are well defined for every pair of points, and are as follows. Given x, y ∈ M, v ∈ T x M, and a geodesic γ of length v such that γ(0) = x, γ(1) = y, γ (0) = v/ v , we have that Exp x (v) = y and Exp -1 x (y) = v. Note, however, that Exp x (•) might not be defined for each v ∈ T x M. We denote by d(x, y) the distance between x and y. Its value is the same as Exp -1 x (y) . Given a 2-dimensional subspace V ⊆ T x M, the sectional curvature at x with respect to V is defined as the Gauss curvature of the manifold Exp x (V ) at x. Notation. Let M be a manifold and let B ⊆ R d . We denote by h : M → B a geodesic map Kreyszig (1991) , which is a diffeomorphism such that the image and the inverse image of a geodesic is a geodesic. Usually, given an initial point x 0 of our algorithm, we will have h(x 0 ) = 0. Given a point x ∈ M we use the notation x = h(x) and vice versa, any point in B will use a tilde. Given two points x, y ∈ M and a vector v ∈ T x M in the tangent space of x, we use the formal notation v, y - x def = -v, x -y def = v, Exp -1 x (y) . Given a vector v ∈ T x M, we call ṽ ∈ R d the vector of the same norm such that {x + λṽ| λ ∈ R + , x + λṽ ∈ B} = {h(Exp x (λv))|λ ∈ I ⊆ R + }, for some interval I. Likewise, given x and a vector ṽ ∈ R d , we define v ∈ T x M. Let x * be any minimizer of F : M → R. We denote by R ≥ d(x 0 , x * ) a bound on the distance between x * and the initial point x 0 . Note that this implies that x * ∈ Exp x0 ( B(0, R)), for the closed ball B(0, R) ⊆ T x0 M. Consequently, we will work with the manifold that is a subset of a d-dimensional complete and simply connected manifold of constant sectional curvature K, namely a subset of the hyperbolic space or sphere Petersen et al. (2006) , defined as Exp x0 ( B(0, R)), with the inherited metric. Denote by H this manifold in the former case and S in the latter, and note that we are not making explicit the dependence on d, R and K. We want to work with the standard choice of uniquely geodesic manifolds Ahn & Sra (2020) ; Liu et al. (2017) ; Zhang & Sra (2016; 2018) . Therefore, in the case that the manifold is S, we restrict ourselves to R < π/2 √ K, so S is contained in an open hemisphere. The big O notations O(•) and O * (•) omit log(L/ε) and log(L/µ) factors, respectively, and constant factors depending on R and K. We define now the main properties that will be assumed on the function F to be minimized. Definition 1.1 (Geodesic Convexity and Smoothness). Let F : M → R be a differentiable function defined on a Riemannian manifold (M, g). Given L ≥ µ > 0, we say that F is L-smooth, and respectively µ-strongly g-convex, if for any two points x, y ∈ M, F satisfies F (y) ≤ F (x) + ∇F (x), y -x + L 2 d(x, y) 2 , resp. F (y) ≥ F (x) + ∇F (x), y -x + µ 2 d(x, y) 2 . We say F is g-convex if the second inequality above, i.e. µ-strong g-convexity, is satisfied with µ = 0. Note that we have used the formal notation above for the subtraction of points in the inner product. Comparison with Related Work. There are a number of works that study the problem of firstorder acceleration in Riemannian manifolds of bounded sectional curvature. The first study is Liu et al. (2017) . In this work, the authors develop an accelerated method with the same rates as AGD for both g-convex and strongly g-convex functions, provided that at each step a given nonlinear equation can be solved. No algorithm for solving this equation has been found and, in principle, it could be intractable or infeasible. In Alimisis et al. (2019) a continuous method analogous to the continuous approach to accelerated methods is presented, but it is not known if there exists an accelerated discretization of it. In Alimisis et al. (2020) , an algorithm presented is claimed to enjoy an accelerated rate of convergence, but fails to provide convergence when the function value gets below a potentially large constant that depends on the manifold and smoothness constant. In Huang & Wei (2019a) an accelerated algorithm is presented but relying on strong geometric inequalities that are not proved to be satisfied. Zhang & Sra (2018) obtain a local algorithm that optimizes L-smooth and µ-strongly g-convex functions achieving the same rates as AGD in the Euclidean space, up to constants. That is, the initial point needs to start close to the optimum, O((µ/L) 3/4 ) close, to be precise. Their approach consists of adapting Nesterov's estimate sequence technique by keeping a quadratic on T xt M that induces on M a regularized lower bound on F (x * ) via Exp xt (•). They aggregate the information yielded by the gradient to it, and use a geometric lemma to find a quadratic in T xt+1 M whose induced function lower bounds the other one. Ahn & Sra (2020) generalize the previous algorithm and, by using similar ideas for the lower bound, they adapt it to work globally, obtaining strictly better rates than RGD, recovering the local acceleration of the previous paper, but not achieving global rates comparable to the ones of AGD. In fact, they prove that their algorithm eventually decreases the function value at a rate close to AGD but this can take as many iterations as the ones needed by RGD to minimize the function. In our work, we take a step back and focus on the constant sectional curvature case to provide a global algorithm that achieves the same rates as AGD, up to constants and log factors. It is common to characterize the properties of spaces of bounded sectional curvature by using the ones of the spaces of constant extremal sectional curvature Grove et al. (1997) ; Zhang & Sra (2016; 2018) , which makes the study of the constant sectional curvature case critical to the development of full accelerated algorithms in the general bounded sectional curvature case. Additionally, our work studies g-convexity besides strong g-convexity. Another related work is the approximate duality gap technique Diakonikolas & Orecchia (2019) , which presents a unified view of the analysis of first-order methods for the optimization of convex functions defined in the Euclidean space. It defines a continuous duality gap and by enforcing a natural invariant, it obtains accelerated continuous dynamics and their discretizations for most classical first-order methods. A derived work Diakonikolas & Orecchia (2018) obtains acceleration in a fundamentally different way from previous acceleration approaches, namely using an approximate implicit Euler method for the discretization of the acceleration dynamics. The convergence analysis of Theorem 2.4 is inspired by these two works. We will see in the sequel that, for our manifolds of interest, g-convexity is related to a model known in the literature as quasar-convexity or weak-quasiconvexity Guminov & Gasnikov (2017) ; Hinder et al. (2019) ; Nesterov et al. (2018) .

2. ALGORITHM

We study the minimization problem min x∈M F (x) with a gradient oracle, for a smooth function F : M → R that is g-convex or strongly g-convex. In this section, M refers to a manifold that can be H or S, i.e. the subset of the hyperbolic space or sphere Exp x0 ( B(0, R)), for an initial point x 0 . For simplicity, we do not use subdifferentials so we assume F : M → R is a differentiable function that is defined over the manifold of constant sectional curvature M def = Exp x0 (B(0, R )) , for an R > R, and we avoid writing F : M → R. We defer the proofs of the lemmas and theorems in this and following sections to the supplementary material. We assume without loss of generality that the sectional curvature of M is K ∈ {1, -1}, since for any other value of K and any function F : M → R defined on such a manifold, we can reparametrize F by a rescaling, so it is defined over a manifold of constant sectional curvature K ∈ {1, -1}. The parameters L, µ and R are rescaled accordingly as a function of K, cf. Remark C.1. We denote the special cosine by C K (•), which is cos(•) if K = 1 and cosh(•) if K = -1. We define X = h(M) ⊆ B ⊆ R d . We use classical geodesic maps for the manifolds that we consider: the Gnomonic projection for S and the Beltrami-Klein projection for H Greenberg (1993). They map an open hemisphere and the hyperbolic space of curvature K ∈ {1, -1} to B = R d and B = B(0, 1) ⊆ R d , respectively. We will derive our results from the following characterization Greenberg (1993). Let x, ỹ ∈ B be two points. Recall that we denote x = h -1 (x), y = h -1 (ỹ) ∈ M. Then we have that d(x, y), the distance between x and y with the metric of M, satisfies C K (d(x, y)) = 1 + K x, ỹ 1 + K x 2 • 1 + K ỹ 2 . (1) Observe that the expression is symmetric with respect to rotations. In particular, the symmetry implies X is a closed ball of radius R, with C K (R) = (1 + K R2 ) -1/2 . Consider a point x ∈ M and the lower bound provided by the g-convexity assumption when computing ∇F (x). Dropping the µ term in case of strong g-convexity, this bound is linear over T x M. We would like our algorithm to aggregate effectively the lower bounds it computes during the course of the optimization. The deformations of the geometry make it a difficult task, despite the fact that we have a simple description of each individual lower bound. We deal with this problem in the following way: our approach is to obtain a lower bound that is looser by a constant depending on R, and that is linear over B. In this way the aggregation becomes easier. Then, we are able to combine this lower bound with decreasing upper bounds in the fashion some other accelerated methods work in the Euclidean space Allen Zhu & Orecchia (2017); Diakonikolas & Orecchia (2018; 2019) ; Nesterov (1983) . Alternatively, we can see the approach in this work as the constrained non-convex optimization problem of minimizing the function f : X → R, x → F (h -1 (x)): minimize f (x), for x ∈ X . In the rest of the section, we will focus on the g-convex case. For simplicity, instead of solving the strongly g-convex case directly in an analogous way by finding a lower bound that is quadratic over B, we rely on the reductions of Section 3 to obtain the accelerated algorithm in this case. The following two lemmas show that finding the aforementioned linear lower bound is possible, and is defined as a function of ∇f (x). We first gauge the deformations caused by the geodesic map h. Distances are deformed, the map h is not conformal and, in spite of it being a geodesic map, the image of the geodesic Exp x (λ∇F (x)) is not mapped into the image of the geodesic x + λ∇f (x), i.e. the direction of the gradient changes. We are able to find the linear lower bound after bounding these deformations. Lemma 2.1. Let x, y ∈ M be two different points, and in part b) different from x 0 . Let α be the angle ∠x 0 xỹ, formed by the vectors x0 -x and ỹ -x. Let α be the corresponding angle between the vectors Exp -1 x (x 0 ) and Exp -1 x (y). Assume without loss of generality that x ∈ span{ẽ 1 } and ∇f (x) ∈ span{ẽ 1 , ẽ2 } for the canonical orthonormal basis {ẽ i } d i=1 . Let e i ∈ T x M be the unit vector such that h maps the image of the geodesic Exp x (λe i ) to the image of the geodesic x + λe i , for i = 1, . . . , d, and λ, λ ≥ 0. Then, the following holds. a) Distance deformation: KC 2 K (R) ≤ K d(x, y) x -ỹ ≤ K. b) Angle deformation: sin(α) = sin(α) 1 + K x 2 1 + K x 2 sin 2 ( α) , cos(α) = cos(α) 1 1 + K x 2 sin 2 (α) c) Gradient deformation: ∇F (x) = (1 + K x 2 )∇f (x) 1 e 1 + 1 + K x 2 ∇f (x) 2 e 2 and e i ⊥ e j for i = j. And if v ∈ T x M is a vector normal to ∇F (x), then ṽ is normal to ∇f (x). The following uses the deformations described in the previous lemma to obtain the linear lower bound on the function, given a gradient at a point x. Note that Lemma 2.1.c implies that we have ∇f (x), ỹ -x = 0 if and only if ∇F (x), y -x = 0. In the proof we lower bound, generally, linear functions defined on T x M by linear functions in the Euclidean space B. This generality allows to obtain a result with constants that only depends on R. Lemma 2.2. Let F : M → R be a differentiable function and let f = F • h -1 . Then, there are constants γ n , γ p ∈ (0, 1] depending on R such that for all x, y ∈ M satisfying ∇f (x), ỹ -x = 0 we have: γ p ≤ ∇F (x), y -x ∇f (x), ỹ -x ≤ 1 γ n . In particular, if F is g-convex we have: f (x) + 1 γ n ∇f (x), ỹ -x ≤ f (ỹ) if ∇f (x), ỹ -x ≤ 0, f (x) + γ p ∇f (x), ỹ -x ≤ f (ỹ) if ∇f (x), ỹ -x ≥ 0. The two inequalities in (3) show the linear lower bound. Only the first one is needed to bound f (x * ) = F (x * ). The first inequality applied to ỹ = x * defines a model known in the literature as quasar-convexity or weak-quasi-convexity Guminov & Gasnikov (2017) ; Hinder et al. (2019) ; Nesterov et al. (2018) , for which accelerated algorithms exist in the unconstrained case, provided smoothness is also satisfied. However, to the best of our knowledge, there is no known algorithm for solving the constrained case in an accelerated way. The condition in ( 3) is, trivially, a relaxation of convexity that is stronger than quasar-convexity. We will make use of ( 3) in order to obtain acceleration in the constrained setting. This is of independent interest. Recall that we need the constraint to guarantee bounded deformation due to the geometry. We also require smoothness of f . The following lemma shows that f is as smooth as F up to a constant depending on R. Lemma 2.3. Let F : M → R be an L-smooth function and f = F • h -1 . Assume there is a point x * ∈ M such that ∇F (x * ) = 0. Then f is O(L)-smooth. Using the approximate duality gap technique Diakonikolas & Orecchia (2019) we obtain accelerated continuous dynamics, for the optimization of the function f . Then we adapt AXGD to obtain an accelerated discretization. AXGD Diakonikolas & Orecchia (2018) is a method that is based on implicit Euler discretization of continuous accelerated dynamics and is fundamentally different from AGD and techniques as Linear Coupling Allen Zhu & Orecchia (2017) or Nesterov's estimate sequence Nesterov (1983) . The latter techniques use a balancing gradient step at each iteration and our use of a looser lower bound complicates guaranteeing keeping the gradient step within the constraints. We state the accelerated theorem and provide a sketch of the proof in Section 2.1. Theorem 2.4. Let Q ⊆ R d be a convex set of diameter 2R. Let f : Q → R be an L-smooth function satisfying (3) with constants γ n , γ p ∈ (0, 1]. Assume there is a point x * ∈ Q such that ∇f (x * ) = 0. Then, we can obtain an ε-minimizer of f using O( L/(γ 2 n γ p ε)) queries to the gradient oracle of f . Finally, we have Riemannian acceleration as a direct consequence of Theorem 2.4, Lemma 2.2 and Lemma 2.3. Theorem 2.5 (g-Convex Acceleration). Let F : M → R be an L-smooth and g-convex function and assume there is a point x * ∈ M satisfying ∇F (x * ) = 0. Algorithm 1 computes a point x t ∈ M satisfying F (x t ) -F (x * ) ≤ ε using O( L/ε) queries to the gradient oracle. We observe that if there is a geodesic map mapping a manifold into a convex subset of the Euclidean space then the manifold must necessarily have constant sectional curvature, cf. Beltrami's Theorem Busemann & Phadke (1984) ; Kreyszig (1991) . This precludes a straightforward generalization from our method to the case of non-constant bounded sectional curvature.

Algorithm 1 Accelerated g-Convex Minimization

Input: Smooth and g-convex function F : M → R, for M = H or M = S. Initial point x 0 ; Constants L, γ p , γ n . Geodesic map h satisfying (1) and h(x 0 ) = 0. Bound on the distance to a minimum R ≥ d(x 0 , x * ). Accuracy ε and number of iterations t. 1: X def = h(Exp x0 (B(0, R))) ⊆ B; f def = F • h -1 and ψ(x) def = 1 2 x 2 2: z0 ← ∇ψ(x 0 ); A 0 ← 0 3: for i from 0 to t -1 do 4: a i+1 ← (i + 1)γ 2 n γ p /2 L 5: A i+1 ← A i + a i+1 6: λ ← BinaryLineSearch(x i , zi , f, X , a i+1 , A i , ε, L, γ n , γ p ) (cf. Algorithm 2 in Appendix A) 7: χi ← (1 -λ)x i + λ∇ψ * (z i ) 8: ζi ← zi -(a i+1 /γ n )∇f ( χi ) 9: xi+1 ← (1 -λ)x i + λ∇ψ * ( ζi ) ∇ψ * (p) = arg min z∈X { z -p } = Π X (p) 10: zi+1 ← zi -(a i+1 /γ n )∇f (x i+1 ) 11: end for 12: return x t . 2.1 SKETCH OF THE PROOF OF THEOREM 2.4. Diakonikolas & Orecchia (2019) , let α t be an increasing function of time t, and denote A t = t t0 dα τ = t t0 ατ dτ . We define a continuous method that keeps a solution xt , along with a differentiable upper bound U t on f (x t ) and a lower bound L t on f (x * ). In our case f is differentiable so we can just take U t = f (x t ). The lower bound comes from

Inspired by the approximate duality gap technique

f (x * ) ≥ t t0 f (x τ )dα τ A t + t t0 1 γn ∇f (x τ ), x * -xτ dα τ A t , after applying some desirable modifications, like regularization with a 1-strongly convex function ψ and removing the unknown x * by taking a minimum over X . Note (4) comes from averaging (3) for ỹ = x * . Then, if we define the gap G t = U t -L t and design a method that forces α t G t to be non-increasing, we can deduce f (x t ) -f (x * ) ≤ G t ≤ α t0 G t0 /α t . By forcing d dt (α t G t ) = 0, we naturally obtain the following continuous dynamics, where z t is a mirror point and ψ * is the Fenchel dual of ψ, cf. Definition A.2. żt = - 1 γ n αt ∇f (x t ); ẋt = 1 γ n αt ∇ψ * (z t ) -xt α t ; zt0 = ∇ψ(x t0 ), xt0 ∈ X We note that except for the constant γ n , these dynamics match the accelerated dynamics used in the optimization of convex functions Diakonikolas & Orecchia (2019; 2018) ; Krichene et al. (2015) . The AXGD algorithm Diakonikolas & Orecchia (2018) , designed for the accelerated optimization of convex functions, discretizes the latter dynamics following an approximate implementation of implicit Euler discretization. This has the advantage of not needing a gradient step per iteration to compensate for some positive discretization error. Note that in our case we must use (3) instead of convexity for a discretization. We are able to obtain the following discretization coming from an approximate implicit Euler discretization: χi = γiAi Ai γi+ai+1/γn xi + ai+1/γn Ai γi+ai+1/γn ∇ψ * (z i ); ζi = zi -ai+1 γn ∇f ( χi ) xi+1 = γiAi Ai γi+ai+1/γn xi + ai+1/γn Ai γi+ai+1/γn ∇ψ * ( ζi ); zi+1 = zi -ai+1 γn ∇f (x i+1 ) where γi ∈ [γ p , 1/γ n ] is a parameter, x0 ∈ X is an arbitrary point, z0 = ∇ψ(x 0 ) and now α t is a discrete measure and αt is a weighted sum of Dirac delta functions αt = ∞ i=1 a i δ(t -(t 0 + i -1)). Compare ( 6) with the discretization in AXGD Diakonikolas & Orecchia (2018) that is equal to our discretization but with no γ n or γi . Or equivalently with γi = 1/γ n and with no γ n for the mirror descent updates of ζi and zi+1 . However, not having convexity, in order to have per-iteration discretization error less than ε/A T , we require γi to be such that xi+1 satisfies f (x i+1 ) -f (x i ) ≤ γi ∇f (x i+1 ), xi+1 -xi + ε, where ε is chosen so that the accumulated discretization error is < ε/2, after having performed the steps necessary to obtain an ε/2 minimizer. We would like to use ( 3) to find such a γi but we need to take into account that we only know xi+1 a posteriori. Indeed, using (3) we conclude that setting γi to 1/γ n or γ p then we either satisfy (7) or there is a point γi ∈ (γ p , 1/γ n ) for which ∇f (x i+1 ), xi+1 -xi = 0, which satisfies the equation for ε = 0. Then, using smoothness of f , existence of x * (that satisfies ∇f (x * ) = 0), and boundedness of X we can guarantee that a binary search finds a point satisfying (7) in O(log( Li/γ n ε)) iterations. Each iteration of the binary search requires to run (6), that is, one step of the discretization. Computing the final discretization error, we obtain acceleration after choosing appropriate learning rates a i . Algorithm 1 contains the pseudocode of this algorithm along with the reduction of the problem from minimizing F to minimizing f . We chose ψ(x) def = 1 2 x 2 as our strongly convex regularizer.

3. REDUCTIONS

The construction of reductions proves to be very useful in order to facilitate the design of algorithms in different settings. Moreover, reductions are a helpful tool to infer new lower bounds without extra ad hoc analysis. We present two reductions. We will see in Corollary 3.2 and Example 3.4 that one can obtain full accelerated methods to minimize smooth and strongly g-convex functions from methods for smooth and g-convex functions and vice versa. These are generalizations of some reductions designed to work in the Euclidean space Allen Zhu & Hazan (2016); Allen Zhu & Orecchia (2017) . The reduction to strongly g-convex functions takes into account the effect of the deformation of the space on the strong convexity of the function F y (x) = d(x, y) 2 /2, for x, y ∈ M. The reduction to g-convexity requires the rate of the algorithm that applies to g-convex functions to be proportional to the distance between the initial point and the optimum d(x 0 , x * ). The proofs of the statements in this section can be found in the supplementary material. We will use Time ns (•) and Time(•) to denote the time algorithms A ns and A below require, respectively, to perform the tasks we define below. Theorem 3.1. Let M be a Riemannian manifold, let F : M → R be an L-smooth and µ-strongly g-convex function, and let x * be its minimizer. Let x 0 be a starting point such that d(x 0 , x * ) ≤ R. Suppose we have an algorithm A ns to minimize F , such that in time T = Time ns (L, µ, R) it produces a point xT satisfying F (x T ) -F (x * ) ≤ µ • d(x 0 , x * ) 2 /4. Then we can compute an ε-minimizer of F in time O(Time ns (L, µ, R) log(R 2 µ/ε)). Theorem 3.1 implies that if we forget about the strong g-convexity of a function and we treat it as it is just g-convex we can run in stages an algorithm designed for optimizing g-convex functions. The fact that the function is strongly g-convex is only used between stages, as the following corollary shows by making use of Algorithm 1. Corollary 3.2. We can compute an ε-minimizer of an L-smooth and µ-strongly g-convex function F : M → R in O * ( L/µ log(µ/ε)) queries to the gradient oracle, where M = S or M = H. We note that in the strongly convex case, by decreasing the function value by a factor we can guarantee we decrease the distance to x * by another factor, so we can periodically recenter the geodesic map to reduce the constants produced by the deformations of the geometry, see the proof of Corollary 3.2. Finally, we show the reverse reduction. Theorem 3.3. Let M be a Riemannian manifold of bounded sectional curvature, let F : M → R be an L-smooth and g-convex function, and assume there is a point x * ∈ M such that ∇F (x * ) = 0. Let x 0 be a starting point such that d(x 0 , x * ) ≤ R and let ∆ satisfy F (x 0 ) -F (x * ) ≤ ∆. Assume we have an algorithm A that given an L-smooth and µ-strongly g-convex function F : M → R, with minimizer in Exp x0 ( B(0, R)), and any initial point x0 ∈ M produces a point x ∈ Exp x0 ( B(0, R)) in time T = Time(L, µ, M, R) satisfying F (x) -min x∈M F (x) ≤ ( F (x 0 ) -min x∈M F (x))/4. Let T = log 2 (∆/ε)/2 + 1. Then, we can compute an ε-minimizer in time T -1 t=0 Time(L + 2 -t ∆K - R /R 2 , 2 -t ∆K + R /R 2 , M, R) , where K + R and K - R are constants that depend on R and the bounds on the sectional curvature of M. Example 3.4. Applying reduction Theorem 3.3 to the algorithm in Corollary 3.2 we can optimize L-smooth and g-convex functions defined on H or S with a gradient oracle complexity of O(L/ √ ε). Note that this reduction cannot be applied to the locally accelerated algorithm in (Zhang & Sra, 2018) , that we discussed in the related work section. The reduction runs in stages by adding decreasing µ i -strongly convex regularizers until we reach µ i = O(ε). The local assumption required by the algorithm in (Zhang & Sra, 2018) on the closeness to the minimum cannot be guaranteed. In (Ahn & Sra, 2020) , the authors give an unconstrained global algorithm whose rates are strictly better than RGD. The reduction could be applied to a constrained version of this algorithm to obtain a method for smooth and g-convex functions defined on manifolds of bounded sectional curvature and whose rates are strictly better than RGD.

4. CONCLUSION

In this work we proposed a first-order method with the same rates as AGD, for the optimization of smooth and g-convex or strongly g-convex functions defined on a manifold other than the Euclidean space, up to constants and log factors. We focused on the hyperbolic and spherical spaces, that have constant sectional curvature. The study of geometric properties for the constant sectional curvature case can be usually employed to conclude that a space of bounded sectional curvature satisfies a property that is in between the ones for the cases of constant extremal sectional curvature. Several previous algorithms have been developed for the optimization in Riemannian manifolds of bounded sectional curvature by utilizing this philosophy, for instance Ahn & Sra (2020); Ferreira et al. (2019) ; Wang et al. (2015) ; Zhang & Sra (2016; 2018) . In future work, we will attempt to use the techniques and insights developed in this work to give an algorithm with the same rates as AGD for manifolds of bounded sectional curvature. The key technique of our algorithm is the effective lower bound aggregation. Indeed, lower bound aggregation is the main hurdle to obtain accelerated first-order methods defined on Riemannian manifolds. Whereas the process of obtaining effective decreasing upper bounds on the function works similarly as in the Euclidean space-the same approach of locally minimizing the upper bound given by the smoothness assumption is used-obtaining adequate lower bounds proves to be a difficult task. We usually want a simple lower bound such that it, or a regularized version of it, can be easily optimized globally. We also want that the lower bound combines the knowledge that the g-convexity or g-strong convexity provides for all the queried points, commonly an average. These Riemannian convexity assumptions provide simple lower bounds, namely linear or quadratic, but each with respect to each of the tangent spaces of the queried points only. The deformations of the space complicate the aggregation of the lower bounds. Our work deals with this problem by finding appropriate lower bounds via the use of a geodesic map and takes into account the deformations incurred to derive a fully accelerated algorithm. We also needed to deal with other technical problems. Firstly, we needed a lower bound on the whole function and not only on F (x * ), for which we had to construct two different linear lower bounds, obtaining a relaxation of convexity. Secondly, we had to use an implicit discretization of an accelerated continuous dynamics, since at least the vanilla application of usual approaches like Linear Coupling Allen Zhu & Orecchia (2017) or Nesterov's estimate sequence Nesterov (1983) , that can be seen as a forward Euler discretization of the accelerated dynamics combined with a balancing gradient step Diakonikolas & Orecchia (2019) , did not work in our constrained case. We interpret that the difficulty arises from trying to keep the gradient step inside the constraints while being able to compensate for a lower bound that is looser by a constant factor.

