ACCELERATED RIEMANNIAN OPTIMIZATION: HAN-DLING CONSTRAINTS TO BOUND GEOMETRIC PENAL-TIES

Abstract

We propose a globally-accelerated, first-order method for the optimization of smooth and (strongly or not) geodesically-convex functions in Hadamard manifolds. Our algorithm enjoys the same convergence rates as Nesterov's accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works resort to assuming that the iterates of their algorithms stay in some pre-specified compact set, except for two previous methods, whose applicability is limited to local optimization and to spaces of constant curvature, respectively. Achieving global and general Riemannian acceleration without iterates assumptively staying in the feasible set was asked as an open question in (Kim & Yang, 2022), which we solve for Hadamard manifolds. In our solution, we show that we can use a linearly convergent algorithm for constrained strongly g-convex smooth problems to implement a Riemannian inexact proximal point operator that we use as a subroutine, which is of independent interest. Riemannian optimization concerns the optimization of a function defined over a Riemannian manifold. It is motivated by constrained problems that can be naturally expressed on Riemannian manifolds allowing to exploit the geometric structure of the problem and effectively transforming it into an unconstrained one. Moreover, there are problems that are not convex in the Euclidean setting, but that when posed as problems over a manifold with the right metric, are convex when restricted to every geodesic, and this allows for fast optimization (Cruz Neto et al., 



, projection-free (Weber & Sra, 2017; 2019) , saddle-point-escaping (Criscitiello & Boumal, 2019; Sun et al., 2019; Zhou et al., 2019; Criscitiello & Boumal, 2020) , stochastic (Hosseini & Sra, 2017; Khuzani & Li, 2017; Tripuraneni et al., 2018) , variance-reduced (Sato et al., 2017; 2019; Zhang et al., 2016) , and min-max methods (Zhang et al., 2022) , among others. Riemannian generalizations to accelerated convex optimization are appealing due to their better convergence rates with respect to unaccelerated methods, specially in ill-conditioned problems. Acceleration in Euclidean convex optimization is a concept that has been broadly explored and has provided many different fast algorithms. A paradigmatic example is Nesterov's Accelerated Gradient Descent (AGD), cf. (Nesterov, 1983) , which is considered the first general accelerated method, where the conjugate gradients method can be seen as an accelerated predecessor in a more limited scope (Martínez-Rubio, 2021) . There have been recent efforts to better understand this phenomenon in the Euclidean case (Allen Zhu & Orecchia, 2017; Su et al., 2016; Drori & Teboulle, 2014; Wibisono et al., 2016; Diakonikolas & Orecchia, 2019; Joulani et al., 2020) , which have yielded some fruitful techniques for the general development of methods and analyses. These techniques have allowed for a considerable number of new results going beyond the standard oracle model, convexity, or beyond first-order, in a wide variety of settings (Tseng, 2008; Beck & Teboulle, 2009; Wang et al., 2016a; Allen Zhu & Orecchia, 2015; Allen-Zhu, 2017; 2018; Carmon et al., 2017; Diakonikolas & Orecchia, 2018; Hinder et al., 2019; Gasnikov et al., 2019; Ivanova et al., 2021; Kamzolov & Gasnikov, 2020; Criado et al., 2021) , among many others. There have been some efforts to achieve acceleration for Riemannian algorithms as generalizations of AGD, cf. Section 1.3. These works try to answer the following fundamental question: Can a Riemannian first-order method enjoy the same rates of convergence as Euclidean AGD? The question is posed under (possibly strongly) geodesic convexity and smoothness of the function to be optimized. And due to the lower bound in (Criscitiello & Boumal, 2021) , we know the optimization must be under bounded geodesic curvature of the Riemannian manifold, and we might have to optimize over a bounded domain.

Main result

In this work, we study the question above in the case of Hadamard manifolds ℳ of bounded sectional curvature and provide an instance of our framework for a wide class of Hadamard manifolds. For a differentiable 𝑓 : ℳ → R with a global minimizer at 𝑥 * , let 𝑥 0 ∈ ℳ be an initial point and 𝑅 be an upper bound on the distance 𝑑(𝑥 0 , 𝑥 * ). If 𝑓 is 𝐿-smooth and (possibly 𝜇-strongly) g-convex in a closed ball of center 𝑥 * and radius 𝑂(𝑅), our algorithms obtain the same rates of convergence as AGD, up to logarithmic factors and up to a geometric penalty factor, cf. Theorem 2.4. See Table 1 for a succint comparison among accelerated algorithm and their rates. This algorithm is a consequence of the general framework we design: General accelerated scheme Riemacon. Given a not necessarily accelerated, linearly-convergent subroutine for strongly g-convex smooth problems, constrained to a geodesically convex set 𝒳 , we design first-order algorithms that enjoy the same rates as AGD when approximating min 𝑥∈𝒳 𝑓 (𝑥), up to logarithmic factors and up to a geometric penalty factor, where 𝑓 : 𝒩 ⊂ ℳ → R is a differentiable function that is smooth and g-convex (or strongly g-convex) in 𝒳 ⊂ 𝒩 , cf. Theorem 2.2. Importantly, our algorithms obtain acceleration without an undesirable assumption that most previous works had to make: that the iterates of the algorithm stay inside of a pre-specified compact set without any mechanism for enforcing or guarateeing this condition. To the best of our knowledge only two previous methods are able to deal with some form of constraints, and they apply to the limited settings of local optimization (Criscitiello & Boumal, 2021) and constant sectional curvature manifolds (Martínez-Rubio, 2021), respectively. Techniques in the rest of papers resort to assuming that the iterates of their algorithms are always feasible. Removing this condition in general, global, and fully accelerated methods was posed as an open question in (Kim & Yang, 2022) , that we solve for the case of Hadamard manifolds. The difficulty of constraining problems in order to bound geometric penalties as well as the necessity of achieving this goal in order to provide full optimization guarantees with bounded geometric penalties is something that has also been noted in other kinds of Riemannian algorithms, cf. (Hosseini & Sra, 2020) . We develop new techniques on inexact proximal methods in Riemannian manifolds and show that with access to a (not necessarily accelerated) constrained linear subroutine for strongly g-convex and smooth problems, we can inexactly solve a proximal subproblem to enough accuracy so it can be used in our accelerated outer loop, in the spirit of other Euclidean algorithms like Catalyst (Lin et al., 2017) . After building this machinery, we show that we are able to implement an inexact ball optimization oracle, cf. (Carmon et al., 2020) , as an instance of our solution. Crucially, the diameter 𝐷 of this ball depends on 𝑅 and the geometry only, so in particular it is independent on the condition number of 𝑓 . We can use the linearly convergent algorithm in (Criscitiello & Boumal, 2021) for the implementation of the prox subroutine and we show that iterating the application of the ball optimization oracle leads to global accelerated convergence. The question concerning whether there are Riemannian analogs to Nesterov's algorithm that enjoy similar rates is a question that, to the best of our knowledge, was first formulated in (Zhang & Sra, 2016) . In particular, since Nesterov's AGD uses a proximal operator of a function's linearization, they ask whether there is a Riemannian analog to this operation that could be used to obtain accelerated rates in the Riemannian case. We show that, instead, a proximal step with respect to the whole function can be approximated efficiently in Hadamard manifolds and it can be used along with an accelerated outer loop. To the best of our knowledge, previously known Riemannian proximal methods either obtain asymptotic analyses, assume exact proximal computation, or work with approximate proximal operators by using different inexactness conditions as ours, and none of them show how to implement the proximal operators or obtain accelerated proximal point methods, cf. Section 1.3.

1.1. PRELIMINARIES

We provide definitions of Riemannian geometry concepts that we use in this work. The interested reader can refer to (Petersen, 2006; Bacák, 2014) for an in-depth review of this topic, but for this work the following notions will be enough. A Riemannian manifold (ℳ, g) is a real 𝐶 ∞ manifold ℳ equipped with a metric g, which is a smoothly varying, i.e., 𝐶 ∞ , inner product. For 𝑥 ∈ ℳ, denote by 𝑇 𝑥 ℳ the tangent space of ℳ at 𝑥. For vectors 𝑣, 𝑤 ∈ 𝑇 𝑥 ℳ, we denote the inner product of the metric by ⟨𝑣, 𝑤⟩ 𝑥 and the norm it induces by ‖𝑣‖ 𝑥 def = √︀ ⟨𝑣, 𝑣⟩ 𝑥 . Most of the time, the point 𝑥 is known from context, in which case we write ⟨𝑣, 𝑤⟩ or ‖𝑣‖. A geodesic of length ℓ is a curve 𝛾 : [0, ℓ] → ℳ of unit speed that is locally distance minimizing. A uniquely geodesic space is a space such that for every two points there is one and only one geodesic that joins them. In such a case the exponential map Exp 𝑥 : 𝑇 𝑥 ℳ → ℳ and the inverse exponential map Log 𝑥 : ℳ → 𝑇 𝑥 ℳ are well defined for every pair of points, and are as follows. Given 𝑥, 𝑦 ∈ ℳ, 𝑣 ∈ 𝑇 𝑥 ℳ, and a geodesic 𝛾 of length ‖𝑣‖ such that 𝛾(0) = 𝑥, 𝛾(‖𝑣‖) = 𝑦, 𝛾 ′ (0) = 𝑣/‖𝑣‖, we have that Exp 𝑥 (𝑣) = 𝑦 and Log 𝑥 (𝑦) = 𝑣. We denote by 𝑑(𝑥, 𝑦) the distance between 𝑥 and 𝑦, and note that it takes the same value as ‖ Log 𝑥 (𝑦)‖. The manifold ℳ comes with a natural parallel transport between vectors in different tangent spaces, that formally is defined from a way of identifying nearby tangent spaces, known as the Levi-Civita connection ∇ ( Levi-Civita, 1977) . We use this parallel transport throughout this work. Given a 2-dimensional subspace 𝑉 ⊆ 𝑇 𝑥 ℳ of the tangent space of a point 𝑥, the sectional curvature at 𝑥 with respect to 𝑉 is defined as the Gauss curvature, for the surface Exp 𝑥 (𝑉 ) at 𝑥. The Gauss curvature at a point 𝑥 can be defined as the product of the maximum and minimum curvatures of the curves resulting from intersecting the surface with planes that are normal to the surface at 𝑥. A Hadamard manifold is a complete simply connected Riemannian manifold whose sectional curvature is non-positive, like the hyperbolic space or the space of 𝑛 × 𝑛 symmetric positive definite matrices with the metric ⟨𝑋, 𝑌 ⟩ 𝐴 def = Tr(𝐴 -1 𝑋𝐴 -1 𝑌 ) where 𝑋, 𝑌 are in the tangent space of 𝐴. Hadamard manifolds are uniquely geodesic. Note that in a general manifold Exp 𝑥 (•) might not be defined for each 𝑣 ∈ 𝑇 𝑥 ℳ, but in a Hadamard manifold of dimension 𝑛, the exponential map at any point is a global diffeomorphism between 𝑇 𝑥 ℳ ∼ = R 𝑛 and the manifold, and so the exponential map is defined everywhere. We now proceed to define the main properties that will be assumed on our model for the function to be minimized and on the feasible set 𝒳 . Definition 1.1 (Geodesic Convexity and Smoothness). Let 𝑓 : 𝒩 ⊂ ℳ → R be a differentiable function defined on an open set 𝒩 contained in a Riemannian manifold ℳ. Given 𝐿 ≥ 𝜇 > 0, we say that 𝑓 is 𝐿-smooth in a set 𝒳 ⊆ 𝒩 if for any two points 𝑥, 𝑦 ∈ 𝒳 , 𝑓 satisfies 𝑓 (𝑦) ≤ 𝑓 (𝑥) + ⟨∇𝑓 (𝑥), Log 𝑥 (𝑦)⟩ + 𝐿 2 𝑑(𝑥, 𝑦) 2 . Analogously, we say that 𝑓 is 𝜇-strongly g-convex in 𝒳 , if for any two points 𝑥, 𝑦 ∈ 𝒳 , we have 𝑓 (𝑦) ≥ 𝑓 (𝑥) + ⟨∇𝑓 (𝑥), Log 𝑥 (𝑦)⟩ + 𝜇 2 𝑑(𝑥, 𝑦) 2 . If the previous inequality is satisfied with 𝜇 = 0, we say the function is g-convex in 𝒳 . We present the following fact about the squared-distance function, when one of the arguments is fixed. The constants 𝜁 𝐷 , 𝛿 𝐷 below appear everywhere in Riemannian optimization because, among other things, Fact 1.2 yields Riemannian inequalities that are analogous to the equality in the Euclidean cosine law of a triangle, cf. Corollary C.3, and these inequalities have wide applicability in the analyses of Riemannian methods. Fact 1.2 (Local information of the squared-distance). Let ℳ be a Riemannian manifold of sectional curvature bounded by [𝜅 min , 𝜅 max ] that contains a uniquely g-convex set 𝒳 ⊂ ℳ of diameter 𝐷 < ∞. Then, given 𝑥, 𝑦 ∈ 𝒳 we have the following for the function Φ 𝑥 : ℳ → R, 𝑦 ↦ → 1 2 𝑑(𝑥, 𝑦) 2 : ∇Φ 𝑥 (𝑦) = -Log 𝑦 (𝑥) and 𝛿 𝐷 ‖𝑣‖ 2 ≤ Hess Φ 𝑥 (𝑦)[𝑣, 𝑣] ≤ 𝜁 𝐷 ‖𝑣‖ 2 , where 𝜁 𝐷 def = {︂ 𝐷 √︀ |𝜅 min | coth(𝐷 √︀ |𝜅 min |) if 𝜅 min ≤ 0 1 if 𝜅 min > 0 , and 𝛿 𝐷 def = {︂ 1 if 𝜅 max ≤ 0 𝐷 √ 𝜅 max cot(𝐷 √ 𝜅 max ) if 𝜅 max > 0 , In particular, Φ 𝑥 is 𝛿 𝐷 -strongly g-convex and 𝜁 𝐷 -smooth in 𝒳 . See (Lezcano-Casado, 2020 ) for a proof. 1.2 NOTATION. Let ℳ be a uniquely geodesic 𝑛-dimensional Riemannian manifold. Given points 𝑥, 𝑦, 𝑧 ∈ ℳ, we abuse the notation and write 𝑦 in non-ambiguous and well-defined contexts in which we should write Log 𝑥 (𝑦). For example, for 𝑣 ∈ 𝑇 𝑥 ℳ we have ⟨𝑣, 𝑦 -𝑥⟩ = -⟨𝑣, 𝑥 -𝑦⟩ = ⟨𝑣, Log 𝑥 (𝑦) - Log 𝑥 (𝑥)⟩ = ⟨𝑣, Log 𝑥 (𝑦)⟩; ‖𝑣 -𝑦‖ = ‖𝑣 -Log 𝑥 (𝑦)‖; ‖𝑧 -𝑦‖ 𝑥 = ‖ Log 𝑥 (𝑧) -Log 𝑥 (𝑦)‖; and ‖𝑦 -𝑥‖ 𝑥 = ‖ Log 𝑥 (𝑦)‖ = 𝑑(𝑦, 𝑥). We denote by 𝒳 a compact, uniquely geodesic g-convex set of diameter 𝐷 contained in an open set 𝒩 ⊂ ℳ and we use 𝐼 𝒳 for the indicator function of 𝒳 , which is 0 at points in 𝒳 and +∞ otherwise. For a vector 𝑣 ∈ 𝑇 𝑦 ℳ, we use Γ 𝑥 𝑦 (𝑣) ∈ 𝑇 𝑥 ℳ to denote the parallel transport of 𝑣 from 𝑇 𝑦 ℳ to 𝑇 𝑥 ℳ along the unique geodesic that connects 𝑦 to 𝑥. We call 𝑓 : 𝒩 ⊂ ℳ → R a differentiable 𝐿-smooth g-convex function we want to optimize. We use 𝜀 to denote the approximation accuracy parameter, 𝑥 0 ∈ 𝒳 for the initial point of our algorithms, and R def = 𝑑(𝑥 0 , x* ) for the initial distance to an arbitrary constrained minimizer x* ∈ arg min 𝑥∈𝒳 𝑓 (𝑥). We use 𝑅 for an upper bound on the initial distance 𝑑(𝑥 0 , 𝑥 * ) to an unconstrained minimizer 𝑥 * , if it exists. The big-𝑂 notation ̃︀ 𝑂(•) omits log factors. Note that in the setting of Hadamard manifolds, the bounds on the sectional curvature are 𝜅 min ≤ 𝜅 max ≤ 0. Hence for notational convenience, we define ζ def = 𝜁 𝐷 = 𝐷 √︀ |𝜅 min | coth(𝐷 √︀ |𝜅 min |) ≥ 1, δ def = 1, and similarly 𝜁 def = 𝜁 𝑅 and 𝛿 def = 𝛿 𝑅 = 1. If 𝑣 ∈ 𝑇 𝑥 ℳ, we use Π B(0,𝑟) (𝑣) ∈ 𝑇 𝑥 ℳ for the projection of 𝑣 onto the closed ball with center at 0 and radius 𝑟.

1.3. OUR RESULTS AND COMPARISONS WITH RELATED WORK

In this work, we optimize functions defined over Hadamard manifolds ℳ of finite dimension 𝑛 and of sectional curvature bounded in [𝜅 min , 𝜅 max ]. As all previous related works discussed in the sequel, we assume that we can compute the exponential and inverse exponential maps, and parallel transport of vectors for our manifold. The differentiable function 𝑓 to be optimized is defined over an open set 𝒩 ⊂ ℳ that contains a compact g-convex set 𝒳 of finite diameter 𝐷. Our function 𝑓 is 𝐿-smooth and g-convex (or 𝜇-strongly g-convex) in 𝒳 and we have access to it via a gradient oracle that can be queried at points in 𝒳 . For this setting, we show in Theorem 2.2 that with access to a (possibly unaccelerated) linearly convergent subroutine for g-strongly smooth problems in 𝒳 , the algorithms we propose find a point 𝑦 𝑇 ∈ 𝒳 such that 𝑓 (𝑦 𝑇 ) -min 𝑥∈𝒳 𝑓 (𝑥) ≤ 𝜀 after calling the gradient oracle and the subroutine the following number of times: ̃︀ 𝑂( ζ√︀ 𝐿 R2 /𝜀) for the g-convex case and ̃︀ 𝑂( ζ√︀ 𝐿/𝜇 log(𝜇 R2 /𝜀)) for the 𝜇-strongly g-convex case, where R def = 𝑑(𝑥 0 , x* ) and 𝑥 0 ∈ 𝒳 is an initial point. Then in Theorem 2.4, we instantiate our algorithm with the method in (Criscitiello & Boumal, 2021) as subroutine and boost the convergence by implementing and sequentially applying an inexact ball optimization oracle and we obtain the rates ̃︀ 𝑂(𝜁 2 √︀ 𝜁 + 𝐿𝑅 2 /𝜀) and ̃︀ 𝑂(𝜁 2 √︀ 𝐿/𝜇 log(𝜇𝑅 2 /𝜀)) where 𝑅 is a bound on the initial distance 𝑑(𝑥 0 , 𝑥 * ) to an unconstrained minimizer 𝑥 * . In sum, the algorithms enjoy the same rates as AGD in the Euclidean space up to a factor of 𝜁 2 = 𝑅 2 𝜅 2 min coth 2 (𝑅 √︀ |𝜅 min |) ≤ (1 + 𝑅 • |𝜅 min |) 2 (our geometric penalty) and up to universal constants and log factors. Note that as the minimum curvature 𝜅 min approaches 0 we have 𝜁 → 1. We emphasize that our Algorithm 1 only needs to query the gradient of 𝑓 at points in 𝒳 and the 𝐿-smoothness and 𝜇-strong g-convexity of 𝑓 only need to hold in 𝒳 . This is relevant because in Riemannian manifolds the condition number 𝐿/𝜇 in a set can increase with the size of the set, cf. (Martínez-Rubio, 2020, Proposition 27). Intuitively, although there are twice differentiable functions defined over the Euclidean space whose Hessian is constant everywhere, in other Riemannian cases the metric may preclude having such global condition and the larger the set is the larger the minimum possible condition number becomes. Compare this, for instance, with the bounds on the Hessian's eigenvalues of the squared-distance function in Fact 1.2, which are tight for spaces of constant curvature (Lezcano-Casado, 2020 ). Now we proceed to compare our results with previous works. We have summarized most of the following discussion in Table 1 . We include Nesterov's AGD in the table for comparison purposesfoot_0 . There are some works on Riemannian acceleration that focus on empirical evaluation or that work under strong assumptions (Liu et al., 2017; Alimisis et al., 2019; Huang & Wei, 2019a; Alimisis et al., 2020; Lin et al., 2020) , see (Martínez-Rubio, 2020) for instance for a discussion on these works. We focus the discussion on the most related work with guarantees. (Zhang & Sra, 2018) obtain an algorithm that, up to constants, achieves the same rates as AGD in the Euclidean space, for 𝐿-smooth and 𝜇-strongly g-convex functions but only locally, namely when the initial point starts in a small neighborhood 𝑁 of the minimizer 𝑥 * : a ball of radius 𝑂((𝜇/𝐿) 3/4 ) around it. (Ahn & Sra, 2020) generalize the previous algorithm and, by using similar ideas as in (Zhang & Sra, 2018) for estimating a lower bound on 𝑓 , they adapt the algorithm to work globally, proving that it eventually decreases the objective as fast as AGD. However, as (Martínez-Rubio, 2020) noted, it takes as many iterations as the ones needed by Riemannian gradient descent (RGD) to reach the neighborhood of the previous algorithm. The latter work also noted that in fact RGD and the algorithm in (Zhang & Sra, 2018 ) can be run in parallel and combined to obtain the same convergence rates as in (Ahn & Sra, 2020) , which suggested that for this technique, full acceleration with the rates of AGD only happens over the small neighborhood 𝑁 in (Zhang & Sra, 2018) . Note however that (Ahn & Sra, 2020) show that their algorithm will decrease the function value faster than RGD, but this is not quantified. (Jin & Sra, 2021 ) developed a different framework, arising from (Ahn & Sra, 2020) but with the same guarantees for accelerated first-order methods. We do not feature it in the table. (Criscitiello & Boumal, 2021) showed, under mild assumptions, that in a ball of center 𝑥 ∈ ℳ and radius 𝑂((𝜇/𝐿) 1/2 ) containing 𝑥 * , the pullback function 𝑓 ∘ Exp 𝑥 : 𝑇 𝑥 ℳ → R is Euclidean, strongly convex, and smooth with condition number 𝑂(𝐿/𝜇), so AGD yields local acceleration as well. In short, acceleration is possible in a small neighborhood because there the manifold is almost Euclidean and the geometric deformations are small in comparison to the curvature of the objective. These techniques fail for the g-convex case since the neighborhood becomes a point (𝜇/𝐿 = 0). Finding fully accelerated algorithms that are global presents a harder challenge. By a fully accelerated algorithm we mean one with rates with same dependence as AGD on 𝐿, 𝜀, and if it applies, on 𝜇. (Martínez-Rubio, 2020) provided such algorithms for g-convex functions, strongly or not, defined over manifolds of constant sectional curvature and constrained to a ball of radius 𝑅. The convergence rates initially had large constants with respect to 𝑅 but were later improved, cf. Table 1 . Kim & Yang (2022) designed global algorithms with the same rates as AGD up to universal constants and a factor of ζ, their geometric penalty. However, they need to assume that the iterates of their algorithm remain in their feasible set 𝒳 and they point out on the necessity of removing such an assumption, which they leave as an open question. Our work solves this question for the case of Hadamard manifolds. In their technique, they show that they can use the structure of the accelerated scheme to move lower bound estimations on 𝑓 (𝑥 * ) from one particular tangent space to another without incurring extra errors, when the right Lyapunov function is used. By moving lower bounds here we mean finding Table 1 : Convergence rates of related works with provable guarantees for smooth problems over uniquely geodesic manifolds. Column K? refers to the supported values of the sectional curvature, G? to whether the algorithm is global (any initial distance to a minimizer is allowed). Here L and L ′ mean they are local algorithms that require initial distance 𝑂((𝐿/𝜇) -3/4 ) and 𝑂((𝐿/𝜇) -1/2 ), respectively. Column F? refers to whether there is full acceleration, meaning dependence on 𝐿, 𝜇, and 𝜀 like AGD up to possibly log factors. Column C? refers to whether the method can enforce some constraints. All methods require their iterates to be in some specified compact set, but the works with just assume the iterates will remain within the constraints. We use 𝒲  Hadamard * suitable lower bounds that are simple (a quadratic in their case), if pulled-back to one tangent space, if we start with a similar bound that is simple when pulled-back to another tangent space. Lower bounds. In this paragraph, we omit constants depending on the curvature bounds in the big-𝑂 notations for simplicity. (Hamilton & Moitra, 2021) proved an optimization lower bound showing that acceleration in Riemannian manifolds is harder than in the Euclidean space. (Criscitiello & Boumal, 2021) largely generalized their results. They essentially show that for a large family of Hadamard manifolds, there is a function that is smooth and strongly g-convex in a ball of radius 𝑅 that contains the minimizer 𝑥 * , and for which finding a point that is 𝑅/5 close to 𝑥 * requires ̃︀ Ω(𝑅) calls to the gradient oracle. Note that these results do not preclude the existence of a fully accelerated algorithm with rates ̃︀ 𝑂(𝑅)+AGD rates, for instance. A similar hardness statement is provided for smooth and only g-convex functions. Also, reductions as in (Martínez-Rubio, 2020) evince this hardness is also present in this case. Handling constraints to bound geometric penalties. In our algorithm and in all other known fully accelerated algorithms, learning rates depend on the diameter of the feasible set. This is natural: estimation errors due to geometric deformations depend on the diameter via the constants 𝜁 𝐷 , 𝛿 𝐷 , the cosine-law inequalities Corollary C.3, or other analogous inequalities, and the algorithms take these errors into account. All other previous works are not able to deal with any constraints and hence they simply assume that the iterates of their algorithms stay within one such specified set, except for (Martínez-Rubio, 2020) and (Criscitiello & Boumal, 2021 ) that enforce a ball constraint, as we explained above. However, these two works have their applicability limited to spaces of constant curvature and to local optimization, respectively. Note that even if one could show that given a choice of learning rate, convergence implies that the iterates will remain in some compact set, then because the learning rates depend on the diameter of the set, and the diameter of the set would depend on the learning rates, one cannot conclude from this argument that the assumption these works make is going to be satisfied. In contrast, in this work, we design a general accelerated framework and an instance of it that keep the iterates bounded, effectively bounding geometric penalties while we do not need to resort to any other extra assumptions, solving the open question in (Kim & Yang, 2022) . Riemannian proximal methods There have been some works that study proximal methods in Riemannian manifolds, but most of them focus on asymptotic results or assume the proximal operator can be computed exactly (Wang et al., 2015; Bento et al., 2017; 2016; Khammahawong et al., 2021; Chang et al., 2021) . The rest of these works study proximal point methods under different inexact versions of the proximal operator as ours and they do not show how to implement their inexact version in applications, like in our case of smooth and g-convex optimization. In contrast, we implement the inexact proximal operator with a first-order method (Ahmadi & Khatibzadeh, 2014) provide a convergence analysis of an inexact proximal point method but when applied to optimization they assume the computation of the proximal operator is exact. (Tang & Huang, 2014) uses a different inexact condition and proves linear convergence, under a growth condition on 𝑓 . (Wang et al., 2016b) obtains linear convergence of an inexact proximal point method under a different growth assumption on 𝑓 and under an absolute error condition on the proximal function.

2. ALGORITHMIC FRAMEWORK AND PSEUDOCODE

In this section, we present our Riemannian accelerated algorithm for constrained g-convex optimization, or Riemaconfoot_2 . This is a general framework that we later instantiate to provide a full algorithm. Recall our abuse of notation for points 𝑝 ∈ ℳ to mean Log 𝑞 (𝑝) in contexts in which one should place a vector in 𝑇 𝑞 ℳ and note that in our algorithm 𝑥 𝑘 and 𝑦 𝑘 are points in ℳ whereas 𝑧 𝑥 𝑘 𝑘 ∈ 𝑇 𝑥 𝑘 ℳ, 𝑧 𝑦 𝑘 𝑘 , z𝑦 𝑘 𝑘 ∈ 𝑇 𝑦 𝑘 ℳ. Algorithm 1 Riemacon: Riemannian Acceleration -Constrained g-Convex Optimization Input: Feasible set 𝒳 . Initial point 𝑥 0 ∈ 𝒳 ⊂ 𝒩 . Diff. function 𝑓 : 𝒩 ⊂ ℳ → R for a Hadamard manifold ℳ that is 𝐿-smooth and g-convex in 𝒳 . Optionally: final iteration 𝑇 or accuracy 𝜀. If 𝜀 is provided, compute the corresponding 𝑇 , cf. Theorem 2.2. Parameters: • Geometric penalty 𝜉 def = 4𝜁 2𝐷 -3 ≤ 8 ζ -3 = 𝑂( ζ). • Implicit Gradient Descent learning rate 𝜆 def = 𝜁 2𝐷 /𝐿. • Mirror Descent learning rates 𝜂 𝑘 def = 𝑎 𝑘 /𝜉. • Proportionality constant in the proximal subproblem accuracies: ∆ 𝑘 def = 1 (𝑘+1) 2 . Definition: (computation of this value is not needed) • Prox. accuracies: 𝜎 𝑘 def = Δ 𝑘 𝑑(𝑥 𝑘 ,𝑦 * 𝑘 ) 2 78𝜆 where 𝑦 * 𝑘 def = arg min 𝑦∈𝒳 {𝑓 (𝑦) + 1 2𝜆 𝑑(𝑥 𝑘 , 𝑦) 2 }. 1: 𝑦 0 ← 𝑥 0 ; 𝐴 0 ← 200𝜆𝜉 2: 𝑧 𝑥0 0 ← 0 ∈ 𝑇 𝑥0 ℳ; z𝑦0 0 ← 𝑧 𝑦0 0 ← 0 ∈ 𝑇 𝑦0 ℳ 3: for 𝑘 = 1 to 𝑇 do 4: 𝑎 𝑘 ← 2𝜆 𝑘+32𝜉 5 5: 𝐴 𝑘 ← 𝑎 𝑘 /𝜉 + 𝐴 𝑘-1 = ∑︀ 𝑘 𝑖=1 𝑎 𝑖 /𝜉 + 𝐴 0 = 𝜆 (︁ 𝑘(𝑘+1+64𝜉) 5𝜉 + 200𝜉 )︁ 6: 𝑥 𝑘 ← Exp 𝑦 𝑘-1 ( 𝑎 𝑘 𝐴 𝑘-1 +𝑎 𝑘 z𝑦 𝑘-1 𝑘-1 + 𝐴 𝑘-1 𝐴 𝑘-1 +𝑎 𝑘 𝑦 𝑘-1 ) = Exp 𝑦 𝑘-1 ( 𝑎 𝑘 𝐴 𝑘-1 +𝑎 𝑘 z𝑦 𝑘-1 𝑘-1 ) ◇ Coupling 7: 𝑧 𝑥 𝑘 𝑘-1 ← Γ 𝑥 𝑘 𝑦 𝑘-1 (z 𝑦 𝑘-1 𝑘-1 ) + Log 𝑥 𝑘 (𝑦 𝑘-1 ) = Log 𝑥 𝑘 (Exp 𝑦 𝑘 (z 𝑦 𝑘-1 𝑘-1 )) 8: 𝑦 𝑘 ← 𝜎 𝑘 -minimizer of the proximal problem min 𝑦∈𝒳 {𝑓 (𝑦) + 1 2𝜆 𝑑(𝑥 𝑘 , 𝑦) 2 } 9: 𝑣 𝑥 𝑘 ← -Log 𝑥 𝑘 (𝑦 𝑘 )/𝜆 ◇ Approximate subgradient 10: 𝑧 𝑥 𝑘 𝑘 ← 𝑧 𝑥 𝑘 𝑘-1 -𝜂 𝑘 𝑣 𝑥 𝑘 ◇ Mirror Descent step 11: 𝑧 𝑦 𝑘 𝑘 ← Γ 𝑦 𝑘 𝑥 𝑘 (𝑧 𝑥 𝑘 𝑘 ) + Log 𝑦 𝑘 (𝑥 𝑘 ) ◇ Moving the dual point to 𝑇 𝑦 𝑘 ℳ 12: z𝑦 𝑘 𝑘 ← Π B(0,𝐷) (𝑧 𝑦 𝑘 𝑘 ) ∈ 𝑇 𝑦 𝑘 ℳ ◇ Easy projection done so the dual point is not very far 13: end for 14: return 𝑦 𝑇 . We start with an interpretation of our algorithm that helps understanding its high-level ideas. The following intends to be a qualitative explanation, and we refer to the pseudocode and the supplementary material for the exact descriptions and analysis. Euclidean accelerated algorithms can be interpreted, cf. (Allen Zhu & Orecchia, 2017) , as a combination of a gradient descent (GD) algorithm and an online learning algorithm with losses being the affine lower bounds 𝑓 (𝑥 𝑘 ) + ⟨∇𝑓 (𝑥 𝑘 ), • -𝑥 𝑘 ⟩ we obtain on 𝑓 (•) by applying convexity at some points 𝑥 𝑘 . That is, the latter builds a lower bound estimation on 𝑓 . By selecting the next query to the gradient oracle as a cleverly picked convex combination of the predictions given by these two algorithms, one can show that the instantaneous regret of the online learning algorithm can be compensated by the local progress GD makes, which leads to accelerated convergence. In Riemannian optimization, there are two main obstacles. Firstly, the first-order approximations of 𝑓 at points 𝑥 𝑘 yield functions that are affine but only with respect to their respective 𝑇 𝑥 𝑘 ℳ, and so combining these lower bounds that are only simple in their tangent spaces makes obtaining good global estimations not simple. Secondly, when one obtains such global estimations, then one naturally incurs an instantaneous regret that is worse by a factor than is usual in Euclidean acceleration. This factor is a geometric constant depending on the diameter 𝐷 of a set 𝒳 where the iterates and a (possibly constrained) minimizer lie. As a consequence, the learning rate of GD would need to be multiplicatively increased by such a constant with respect to the one of the online learning algorithm in order for the regret to still be compensated with the local progress of GD (and the rates worsen by this constant). But if we fix some 𝒳 of finite diameter, because GD's learning rate is now larger, it is not clear how to keep the iterates in 𝒳 . And if we do not have the iterates in one such set 𝒳 , then our geometric penalties could grow arbitrarily. We find the answer in implicit methods. An implicit Euclidean (sub)gradient descent step is one that computes, from a point 𝑥 𝑘 ∈ 𝒳 , another point 𝑦 * 𝑘 = 𝑥 𝑘 -𝜆𝑣 𝑘 ∈ 𝒳 , where 𝑣 𝑘 ∈ 𝜕(𝑓 + 𝐼 𝒳 )(𝑦 * 𝑘 ), is a subgradient of 𝑓 + 𝐼 𝒳 at 𝑦 * 𝑘 . Intuitively, if we could implement a Riemannian version of an implicit GD step then it should be possible to still compensate the regret of the other algorithm and keep all the iterates in the set 𝒳 . Computing such an implicit step is computationally hard in general, but we show that approximating the proximal objective ℎ 𝑘 (𝑦) def = 𝑓 (𝑦) + 1 2𝜆 𝑑(𝑥 𝑘 , 𝑦) 2 with enough accuracy yields an approximate subgradient that can be used to obtain an accelerated algorithm as well. In particular, we provide an accelerated scheme for which we show that the error incurred by the approximation of the subgradient can be bounded by some terms we can control, cf. Lemma A.2, namely a small term that appears in our Lyapunov function and also a term proportional to the squared norm of the approximated subgradient, which only increases the final convergence rates by a constant. This proximal approach works by exploiting the fact that the Riemannian Moreau envelop is convex in Hadamard manifolds (Azagra & Ferrera, 2005) and that the subproblem ℎ 𝑘 , defined with our 𝜆 = 𝜁 2𝐷 /𝐿, is strongly g-convex and smooth with a condition number that only depends on the geometry. For this reason, a local algorithm like the one in (Criscitiello & Boumal, 2021 ) can be implemented in balls whose radius is independent on the condition number of 𝑓 . Besides these steps, we use a coupling of the approximate implicit RGD and of a mirror descent (MD) algorithm, along with a technique in (Kim & Yang, 2022) to move dual points to the right tangent spaces without incurring extra geometric penalties, that we adapt to work with dual projections, cf. Lemma A.3. Importantly, the MD algorithm keeps the dual point close to the set 𝒳 by using the projection in Line 12, which implies that the point 𝑥 𝑘 is close to 𝒳 as well, and this is crucial to keep low geometric penalties. This MD approach is a mix between follow-the-regularized-leader algorithms, that do not project the dual variable, and pure mirror descent algorithms that always project the dual variable. In the analysis, we note that partial projection also works, meaning that defining a new dual point that is closer to all of the points in the feasible set but without being a full projection leads to the same guarantees. Because we use the mirror descent lemma over 𝑇 𝑦 𝑘 ℳ, what we described translates to: we can project the dual 𝑧 𝑦 𝑘 𝑘 onto a ball defined on 𝑇 𝑦 𝑘 ℳ that contains the pulled-back set Log 𝑦 𝑘 (𝒳 ) and by means of that trick we can keep the iterates 𝑥 𝑘 close to 𝒳 . And at the same time, the point for which we prove guarantees, namely 𝑦 𝑘 , is always in 𝒳 . Finally, we instantiate our subroutine with the algorithm in (Criscitiello & Boumal, 2021) , in balls of radius independent on the condition number of 𝑓 and show in Theorem 2.4 that if we iterate this approximate implementation of a ball optimization oracle, we obtain convergence at a globally accelerated rate. We note (Zhang & Sra, 2016 ) also provided a claimed linearly convergent algorithm for constrained strongly g-convex smooth problems, and thus in principle it could be used for our subroutine. Unfortunately, we noticed that the proof is flawed when the optimization is constrained. We leave the proofs of most of our results to the supplementary material and state our main theorems below. Using the insights explained above, we show the following inequality on 𝜓 𝑘 , defined below, that will be used as a Lyapunov function to prove the convergence rates of Algorithm 1. Proposition 2.1. [↓] By using the notation of Algorithm 1, let 𝜓 𝑘 def = 𝐴 𝑘 (𝑓 (𝑦 𝑘 ) -𝑓 (x * )) + 1 2 ‖𝑧 𝑦 𝑘 𝑘 -Log 𝑦 𝑘 (x * )‖ 2 𝑦 𝑘 + 𝜉 -1 2 ‖𝑧 𝑦 𝑘 𝑘 ‖ 2 𝑦 𝑘 . Then, for all 𝑘 ≥ 1, we have (1 -∆ 𝑘 )𝜓 𝑘 ≤ 𝜓 𝑘-1 . Finally, we can state our theorem for the optimization of 𝐿-smooth and g-convex functions. Theorem 2.2. [↓] Let ℳ be a finite-dimensional Hadamard manifold of bounded sectional curvature, and consider 𝑓 : 𝒩 ⊂ ℳ → R be an 𝐿-smooth and g-convex differentiable function in a compact g-convex set 𝒳 ⊂ 𝒩 of diameter 𝐷, x* ∈ arg min 𝑥∈𝒳 𝑓 (𝑥), and R def = 𝑑(𝑥 0 , x* ). For any 𝜀 > 0, Algorithm 1 yields an 𝜀-minimizer 𝑦 𝑇 ∈ 𝒳 after 𝑇 = 𝑂( ζ√︁ 𝐿 R2 𝜀 ) iterations. If the function is 𝜇-strongly convex then, via a sequence of restarts, we converge in 𝑂( ζ√︁ 𝐿 𝜇 log( 𝜇 R2 𝜀 )) iterations. We note that a straightforward corollary from our results is that if we can compute the exact Riemannian proximal point operator and we use it as the implicit gradient descent step in Line 8 of Algorithm 1, then the method is an accelerated proximal point method. One such Riemannian algorithm was previously unknown in the literature as well. Finally, we instantiate Algorithm 1 to implement approximate ball optimization oracles in an accelerated way. We show that applying these oracles sequentially leads to global accelerated convergence. Moreover, we show that the iterates do not get farther than 2𝑅 from 𝑥 * , which ultimately leads to the geometric penalty being a function of 𝜁 and not on the condition number of 𝑓 . For the subroutine in Line 8 of Algorithm 1, we use the algorithm in (Criscitiello & Boumal, 2021, Section 6) , and for that we require the following. Assumption 2.3. Let R be the curvature tensor of a Riemannian manifold ℳ. Its covariant derivative is ∇R = 0. We note that locally symmetric manifolds, like SO(𝑛), the SPD matrix manifold, the Grasmannian manifold, manifolds of constant sectional curvature are all manifolds such that ∇R = 0. We argue that this assumption is mild, since in particular these manifolds cover all of the applications in Section 1. Theorem 2.4. [↓] Let ℳ be a finite-dimensional Hadamard manifold of bounded sectional curvature satisfying Assumption 2.3. Consider 𝑓 : 𝒩 ⊂ ℳ → R be an 𝐿-smooth and 𝜇-strongly g-convex differentiable function in B(𝑥 * , 3𝑅), where 𝑥 * is its global minimizer and where 𝑅 ≥ 𝑑(𝑥 0 , 𝑥 * ) for an initial point 𝑥 0 . For any 𝜀 > 0, Algorithm 2 yields an 𝜀-minimizer after ̃︀ 𝑂(𝜁 2 √︀ 𝐿/𝜇 log(𝐿𝑅 2 /𝜀)) calls to the gradient oracle of 𝑓 . By using regularization, this algorithm 𝜀-minimizes the g-convex case (𝜇 = 0) after ̃︀ 𝑂(𝜁 ◇ RiemaconSC is the strongly convex version of Algorithm 1 in Theorem 2.2 (cf. its proof). 8: end for 9: return 𝑥 𝑇 .

3. CONCLUSION AND FUTURE DIRECTIONS

In this work, we pursued an approach that, by designing and making use of inexact Riemannian proximal methods, yielded accelerated optimization algorithms. Consequently we were able to work without an undesirable assumption that most previous methods required, whose potential satisfiability is not clear: that the iterates stay in certain specified geodesically-convex set without enforcing them to be in the set. A future direction of research is the study of whether there are algorithms like ours that incur even lower geometric penalties or that do not incur log(1/𝜀) factors. Another interesting direction consists of studying generalizations of our approach to more general manifolds, namely the full Hadamard case, and manifolds of non-negative or even of bounded sectional cuvature.



Note that the original method in(Nesterov, 1983) needed to query the gradient of the function outside of the feasible set, and this was later improved to only require queries at feasible points(Nesterov, 2005) as in our work, hence our choice of citation in the table. Riemacon rhymes with "rima con" in Spanish.



(46𝑅|𝜅 min |𝜁 2𝑅 ) -1 then return RiemaconSC( B(𝑥 0 , 𝑅), 𝑥 0 , 𝑓 , 𝜀)2: Compute 𝐷 such that 𝐷 = (46𝑅|𝜅 min |𝜁 𝐷 ) -1 . Alternatively, make 𝐷 ← (70𝑅|𝜅 min |) -1 . 3: 𝑇 ← ⌈ 4𝑅 𝐷 ln( 𝐿𝑅 2 𝜀 )⌉; 𝜀 ′ ← min{ 𝐷𝜀 8𝑅 , 𝜇𝑅 2 2𝑇 2 } 4: for 𝑘 = 1 to 𝑇 do 𝒳 𝑘 ← B(𝑥 𝑘-1 , 𝐷/2) 𝑥 𝑘 ← RiemaconSC(𝒳 𝑘 , 𝑥 𝑘-1 , 𝑓 , 𝜀 ′ ) ◇ (Criscitiello & Boumal, 2021) as subroutine

