ON THE LOWER BOUND OF MINIMIZING POLYAK-ŁOJASIEWICZ FUNCTIONS Anonymous authors Paper under double-blind review

Abstract

Polyak-Łojasiewicz (PL) (Polyak, 1963) condition is a weaker condition than the strong convexity but suffices to ensure a global convergence for the Gradient Descent algorithm. In this paper, we study the lower bound of algorithms using first-order oracles to find an approximate optimal solution. We show that any first-order algorithm requires at least Ω L µ log 1 ε gradient costs to find an εapproximate optimal solution for a general L-smooth function that has an µ-PL constant. This result demonstrates the optimality of the Gradient Descent algorithm to minimize smooth PL functions in the sense that there exists a "hard" PL function such that no first-order algorithm can be faster than Gradient Descent when ignoring a numerical constant. In contrast, it is well-known that the momentum technique, e.g. (Nesterov, 2003, chap. 2) can provably accelerate Gradient Descent to O L μ log 1 ε gradient costs for functions that are L-smooth and μ-strongly convex. Therefore, our result distinguishes the hardness of minimizing a smooth PL function and a smooth strongly convex function as the complexity of the former cannot be improved by any polynomial order in general.

1. INTRODUCTION

We consider the problem min x∈R d f (x), where the function f is L-smooth and satisfies the Polyak-Łojasiewicz condition. A function f is said to satisfy the Polyak-Łojasiewicz condition if (2) holds for some µ > 0: ∥∇f (x)∥ 2 ≥ 2µ f (x) -inf y∈R d f (y) , ∀x ∈ R d . We refer to (2) as the µ-PL condition and simply denote inf y∈R d f (y) by f * . The PL condition may be originally introduced by Polyak (Polyak, 1963) and Łojasiewicz (Lojasiewicz, 1963) independently. The PL condition is strictly weaker than strong convexity as one can show that any μ-strongly convex function which by definition satisfies: f (x) ≥ f (y) + ⟨∇f (y), x -y⟩ + μ 2 ∥x -y∥ 2 is also μ-PL by minimizing both sides with respect to x (Karimi et al., 2016) . However, the PL condition does not even imply convexity. From a geometric view, the PL condition suggests that the sum of the squares of the gradient dominates the optimal function value gap, which implies that any local stationary point is a global minimizer. Because it is relatively easy to obtain an approximate local stationary point by first-order algorithms, the PL condition serves as an ideal and weaker alternative to strong convexity. In machine learning, the PL condition has received wide attention recently. Lots of models are found to satisfy this condition under different regimes. Examples include, but are not limited to, matrix decomposition and linear neural networks under a specific initialization (Hardt & Ma, 2016; Li et al., 2018) , nonlinear neural networks in the so-called neural tangent kernel regime (Liu et al., 2022) , reinforcement learning with linear quadratic regulator (Fazel et al., 2018) . Compared with strong convexity, the PL condition is much easier to hold since the reference point in the latter only is a minimum point such that x * = argmin y f (y), instead of any y in the domain. Turning to the theoretic side, it is known (Karimi et al., 2016) that the standard Gradient Descent algorithm admits a linear converge to minimize a L-smooth and µ-PL function. To be specific, in order to find an ε-approximate optimal solution x such that f (x) -f * ≤ ε, Gradient Decent needs O( L µ log 1 ε ) gradient computations. However, it is still not clear whether there exist algorithms that can achieve a provably faster convergence rate. In the optimization community, it is perhaps well-known that the momentum technique, e.g. Nesterov (2003, chap. 2), can provably accelerate Gradient Descent from O( L μ log 1 ε ) to O L μ log 1 ε for functions that are L-smooth and μ-strongly convex. Even though some works (J Reddi et al., 2016; Lei et al., 2017) have considered accelerations under different settings, probably faster convergence of first-order algorithms for PL functions is still not obtained up to now. In this paper, we study the first-order complexities to minimize a generic smooth PL function and ask the question: "Is the Gradient Decent algorithm (nearly) optimal or can we design a much faster algorithm?" We answer the question in the language of min-max lower bound complexity for minimizing the Lsmooth and µ-PL function class. We analyze the worst complexity of minimizing any function that belongs to the class using first-order algorithms. Excitingly, we construct a hard instance function showing that any first-order algorithm requires at least Ω L µ log 1 ε gradient costs to find an εapproximate optimal solution. This answers the aforementioned question in an explicit way: the Gradient Descent algorithm is already optimal in the sense that no first-order algorithm can achieve a provably faster convergence rate in general ignoring a numerical constant. For the first time, we distinguish the hardness of minimizing a PL function and a strongly convex function in terms of firstorder complexities, as the momentum technique for smooth and strongly convex functions provably accelerates Gradient Descent by a certain polynomial order. It is worth mentioning that the optimization problem under our consideration is high-dimensional and the goal is to obtain the complexity bounds that do not have an explicit dependency on the dimension. Our technique to establish the lower bound follows from the previous lower bounds in convex (Nesterov, 2003) and non-convex optimization (Carmon et al., 2021) . The main idea is to construct a so-called "zero-chain" function ensuring that any first-order algorithm per-iteratively can only solve one coordinate of the optimization variable. Then for a "zero-chain" function that has a sufficiently high dimension, some number of entries will never reach their optimal values after the execution of any first-order algorithm in certain iterations. To obtain the desired Ω L µ log 1 ε lower bound, we propose a "zero-chain" function similar to Carmon et al. (2020) , which is composed of the worst convex function designed by Nesterov (2003) and a separable function in the form as σ T i=1 v T,c (x i ) or T i=1 v yi (x i ) to destroy the convexity. Different from their separable function, the one that we introduce has a large Lipshictz constant. This property helps us to estimate the PL constant in a convenient way. This new idea gives new insights into the constructions and analyses of instance functions, which might be potentially generalized to establish the lower bounds for other non-convex problems.

NOTATION

We use bold letters, such as x, to denote vectors in the Euclidean space R d , and bold capital letters, such as A, to denote matrices. I d denotes the identity matrix of size d × d. We omit the subscript and simply denote I as the identity matrix when the dimension is clear from context. For x ∈ R d , we use x i to denote its ith coordinate. We use supp(x) to denote the subscripts of non-zero entries of x, i.e. supp(x) = {i : x i ̸ = 0}. We use span x (1) , • • • , x (n) to denote the linear subspace spanned by x (1) , • • • , x (n) , i.e. y : y = n i=1 a i x (i) , a i ∈ R . We call a function f L-smooth if ∇f is L-Lipschitz continuous, i.e. ∥∇f (x) -∇f (y)∥ ≤ L∥x -y∥. We denote f * = inf x f (x).

