ON THE LOWER BOUND OF MINIMIZING POLYAK-ŁOJASIEWICZ FUNCTIONS Anonymous authors Paper under double-blind review

Abstract

Polyak-Łojasiewicz (PL) (Polyak, 1963) condition is a weaker condition than the strong convexity but suffices to ensure a global convergence for the Gradient Descent algorithm. In this paper, we study the lower bound of algorithms using first-order oracles to find an approximate optimal solution. We show that any first-order algorithm requires at least Ω L µ log 1 ε gradient costs to find an εapproximate optimal solution for a general L-smooth function that has an µ-PL constant. This result demonstrates the optimality of the Gradient Descent algorithm to minimize smooth PL functions in the sense that there exists a "hard" PL function such that no first-order algorithm can be faster than Gradient Descent when ignoring a numerical constant. In contrast, it is well-known that the momentum technique, e.g. (Nesterov, 2003, chap. 2) can provably accelerate Gradient Descent to O L μ log 1 ε gradient costs for functions that are L-smooth and μ-strongly convex. Therefore, our result distinguishes the hardness of minimizing a smooth PL function and a smooth strongly convex function as the complexity of the former cannot be improved by any polynomial order in general.

1. INTRODUCTION

We consider the problem min x∈R d f (x), where the function f is L-smooth and satisfies the Polyak-Łojasiewicz condition. A function f is said to satisfy the Polyak-Łojasiewicz condition if (2) holds for some µ > 0: ∥∇f (x)∥ 2 ≥ 2µ f (x) -inf y∈R d f (y) , ∀x ∈ R d . We refer to (2) as the µ-PL condition and simply denote inf y∈R d f (y) by f * . The PL condition may be originally introduced by Polyak (Polyak, 1963) and Łojasiewicz (Lojasiewicz, 1963) independently. The PL condition is strictly weaker than strong convexity as one can show that any μ-strongly convex function which by definition satisfies: f (x) ≥ f (y) + ⟨∇f (y), x -y⟩ + μ 2 ∥x -y∥ 2 is also μ-PL by minimizing both sides with respect to x (Karimi et al., 2016) . However, the PL condition does not even imply convexity. From a geometric view, the PL condition suggests that the sum of the squares of the gradient dominates the optimal function value gap, which implies that any local stationary point is a global minimizer. Because it is relatively easy to obtain an approximate local stationary point by first-order algorithms, the PL condition serves as an ideal and weaker alternative to strong convexity. In machine learning, the PL condition has received wide attention recently. Lots of models are found to satisfy this condition under different regimes. Examples include, but are not limited to, matrix decomposition and linear neural networks under a specific initialization (Hardt & Ma, 2016; Li et al., 2018) , nonlinear neural networks in the so-called neural tangent kernel regime (Liu et al., 2022) , reinforcement learning with linear quadratic regulator (Fazel et al., 2018) . Compared with strong convexity, the PL condition is much easier to hold since the reference point in the latter only is a minimum point such that x * = argmin y f (y), instead of any y in the domain. Turning to the theoretic side, it is known (Karimi et al., 2016) that the standard Gradient Descent algorithm admits a linear converge to minimize a L-smooth and µ-PL function. To be specific, in order to find an ε-approximate optimal solution x such that f (x) -f * ≤ ε, Gradient Decent needs O( L µ log 1 ε ) gradient computations. However, it is still not clear whether there exist algorithms that can achieve a provably faster convergence rate. In the optimization community, it is perhaps well-known that the momentum technique, e.g. Nesterov (2003, chap. 2) , can provably accelerate Gradient Descent from O( L μ log 1 ε ) to O L μ log 1 ε for functions that are L-smooth and μ-strongly convex. Even though some works (J Reddi et al., 2016; Lei et al., 2017) have considered accelerations under different settings, probably faster convergence of first-order algorithms for PL functions is still not obtained up to now. In this paper, we study the first-order complexities to minimize a generic smooth PL function and ask the question: "Is the Gradient Decent algorithm (nearly) optimal or can we design a much faster algorithm?" We answer the question in the language of min-max lower bound complexity for minimizing the Lsmooth and µ-PL function class. We analyze the worst complexity of minimizing any function that belongs to the class using first-order algorithms. Excitingly, we construct a hard instance function showing that any first-order algorithm requires at least Ω L µ log 1 ε gradient costs to find an εapproximate optimal solution. This answers the aforementioned question in an explicit way: the Gradient Descent algorithm is already optimal in the sense that no first-order algorithm can achieve a provably faster convergence rate in general ignoring a numerical constant. For the first time, we distinguish the hardness of minimizing a PL function and a strongly convex function in terms of firstorder complexities, as the momentum technique for smooth and strongly convex functions provably accelerates Gradient Descent by a certain polynomial order. It is worth mentioning that the optimization problem under our consideration is high-dimensional and the goal is to obtain the complexity bounds that do not have an explicit dependency on the dimension. Our technique to establish the lower bound follows from the previous lower bounds in convex (Nesterov, 2003) and non-convex optimization (Carmon et al., 2021) . The main idea is to construct a so-called "zero-chain" function ensuring that any first-order algorithm per-iteratively can only solve one coordinate of the optimization variable. Then for a "zero-chain" function that has a sufficiently high dimension, some number of entries will never reach their optimal values after the execution of any first-order algorithm in certain iterations. To obtain the desired Ω L µ log 1 ε lower bound, we propose a "zero-chain" function similar to Carmon et al. (2020) , which is composed of the worst convex function designed by Nesterov (2003) and a separable function in the form as σ T i=1 v T,c (x i ) or T i=1 v yi (x i ) to destroy the convexity. Different from their separable function, the one that we introduce has a large Lipshictz constant. This property helps us to estimate the PL constant in a convenient way. This new idea gives new insights into the constructions and analyses of instance functions, which might be potentially generalized to establish the lower bounds for other non-convex problems.

NOTATION

We use bold letters, such as x, to denote vectors in the Euclidean space R d , and bold capital letters, such as A, to denote matrices. I d denotes the identity matrix of size d × d. We omit the subscript and simply denote I as the identity matrix when the dimension is clear from context. For x ∈ R d , we use x i to denote its ith coordinate. We use supp(x) to denote the subscripts of non-zero entries of x, i.e. supp(x) = {i : x i ̸ = 0}. We use span x (1) , • • • , x (n) to denote the linear subspace spanned by x (1) , • • • , x (n) , i.e. y : y = n i=1 a i x (i) , a i ∈ R . We call a function f L-smooth if ∇f is L-Lipschitz continuous, i.e. ∥∇f (x) -∇f (y)∥ ≤ L∥x -y∥. We denote f * = inf x f (x). We let x * be any minimizer of f , i.e., x * = argmin f . We always assume the existence of x * . We say that x is an ε-approximate optimal point of f when f (x) -f * ≤ ε.

2. RELATED WORK

Lower Bounds There has been a line of research concerning the lower bounds of algorithms on certain function classes. To the best of our knowledge, (Nemirovskij & Yudin, 1983 ) defines the oracle model to measure the complexity of algorithms, and most existing research on lower bounds follow this formulation of complexity. For convex functions and first-order oracles, the lower bound is studied in Nesterov (2003) , where well-known optimal lower bound Ω(ε -1 2 ) and Ω(κ log 1 ε ) are obtained. For convex functions and nth-order oracles, lower bounds Ω ε -2 3n+1 have been proposed in Arjevani et al. (2019b) . When the function is non-convex, it is generally NP-hard to find its global minima, or to test whether a point is a local minimum or a saddle point (Murty & Kabadi, 1985) . Instead of finding ε-approximate optimal points, an alternative measure is finding ε-stationary points where ∥∇f (x)∥ ≤ ε. Sometimes, additional constraints on the Hessian matrices of second-order stationary points are needed. Results of this kind include Carmon et al. (2020; 2021) ; Fang et al. (2018) ; Zhou & Gu (2019) ; Arjevani et al. (2019a; 2020) . Though a PL function may be nonconvex, it is tractable to find an ε-approximate optimal point, as local minima of a PL function must be global minima. In this paper, we give the lower complexity bound for finding ε-approximate optimal points.

PL Condition

The PL condition was introduced by Polyak (Polyak, 1963) and Łojasiewicz (Lojasiewicz, 1963) independently. Besides the PL condition, there are other relaxations of the strong convexity, including error bounds (Luo & Tseng, 1993) , essential strong convexity (Liu et al., 2014) , weak strong convexity (Necoara et al., 2019) , restricted secant inequality (Zhang & Yin, 2013) , and quadratic growth (Anitescu, 2000) . Karimi et al. (2016) discussed the relationships between these conditions. All these relaxations implies the PL condition except for the quadratic growth, which implies that the PL condition is quite general. There are many other papers that study designing practical algorithms to optimize a PL objective function under different scenarios, for example, Bassily et al. ( 2018 

3.1. UPPER BOUND ON PL FUNCTIONS

Although the PL condition is a weaker condition than strong convexity, it guarantees linear convergence for Gradient Descent. The result can be found in Polyak (1963) and Karimi et al. (2016) . We present it here for completeness. Theorem 1. If f is L-smooth and satisfies µ-PL condition, then the Gradient Descent algorithm with a constant step-size 1 L : x (k+1) = x (k) - 1 L ∇f (x (k) ), has a linear convergence rate. We have: f (x (k) ) -f * ≤ 1 - µ L k (f (x (0) ) -f * ). Theorem 1 shows that the Gradient Descent algorithm finds the ε-approximate optimal point of f in O L µ log 1 ε gradient computations. This gives an upper complexity bound for first-order algorithms. However, it remains open to us whether there are faster algorithms for smooth PL functions. We will establish a lower complexity bound on first-order algorithms, which nearly matches the upper bound.

3.2. DEFINITIONS OF ALGORITHM CLASSES AND FUNCTION CLASSES

An algorithm is a mapping from real-valued functions to sequences. For algorithm A and f : R d → R, we define A[f ] = {x (i) } i∈N to be the sequence of algorithm A acting on f , where x (i) ∈ R d . Note here, the algorithm under our consideration works on function defined on any Euclidean space. We call it the dimension-free property of the algorithm. The definition of algorithms abstracts away from the the optimization process of a function. We consider algorithms which only make use of the first-order information of the iteration sequence. We call them first-order algorithms. If an algorithm is a first-order algorithm, then x (i) = A (i) x (0) , ∇f (x (0) ), • • • , x (i-1) , ∇f (x (i-1) ) , where A (i) is a function depending on A. Perhaps the simplest example of first-order algorithms is Gradient Descent. We are interested in finding an ε-approximate point of a function f . Given a function f : R d → R and an algorithm A, the complexity of A on f is the number of queries to the first-order oracle needed to find an ε-approximate point. We denote T ε (A, f ) to be the gradient complexity of A on f , then T ε (A, f ) = min t t : f (x (t) ) -f * ≤ ε . In practice, we do not have the full information of the function f . We only know that f is in a particular function class F, such as L-smooth functions. Given an algorithm A. We denote T ε (A, F) to be the complexity of A on F, and define T ε (A, F) as follows: T ε (A, F) = sup f ∈F T ε (A, f ). Thus, T ε (A, F) is the worst-case complexity of functions f ∈ F. For searching an ε-approximate optimal point of a function in F, we need to find an algorithm which have a low complexity on F. Denote an algorithm class by A. The lower bound of an algorithm class on F describes the efficiency of algorithm class A on function class F, which is defined to be T ε (A, F) = inf A∈A T ε (A, F) = inf A∈A sup f ∈F T ε (A, f ). ( ) 3.3 ZERO-RESPECTING ALGORITHM Among all the algorithms, a special algorithm class is called zero-respecting algorithms. If A is a zero-respecting algorithm and A[f ] = x (t) t∈N , then the following condition holds for all f : R d → R: supp{x (n) -x (0) } ∈ n-1 i=1 supp{∇f (x (i) )}. That is to say, x (n) -x (0) lies in the linear subspace spanned by ∇f (x (0) ), • • • , ∇f (x (n-1) ). We denote the collection of first-order zero-respecting algorithms with x (0) = 0 by A zr . It is shown by Nemirovskij & Yudin (1983) that a lower complexity bound on first-order zero-respecting algorithms are also a lower complexity bound on all the first-order algorithm when the function class satisfies the orthogonal invariance property.

3.4. ZERO-CHAIN

A zero-chain f is a function that safisfies the following condition: supp(x) ⊆ {1, 2, • • • , k} =⇒ supp(∇f (x)) ⊆ {1, 2, • • • , k + 1}, ∀x. In other words, the support of ∇f (x) lies in a restricted linear subspace depending on the support of x. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 The "worst function in the (convex) world" in Nesterov ( 2003) defined as f k (x) = 1 2 (x 1 -1) 2 + k-1 i=1 (x i+1 -x i ) 2 (11) is a zero-chain, because if x i = 0 for i > n, then (∇f k (x)) i+1 = 0 for i > n. A zero-chain is difficult to optimize for zero-respecting algorithms, because zero-respecting algorithms only discover one coordinate by one gradient computation.

4. MAIN RESULTS

According to Theorem 1, we already have an upper complexity bound O L µ log 1 ε by applying Gradient Descent to all the PL functions. In this section, we establish the lower complexity bound of first-order algorithms on PL functions. Let P(∆, µ, L) be the collection of all L-smooth and µ-PL functions f with f (x (0) ) -f * ≤ ∆. We establish a lower bound of T ε (A zr , P(∆, µ, L)) by constructing a function which is hard to optimize for zero-respecting algorithms, and extend the result to first-order algorithms. We first propose a relatively simple hard instance which leads to a Ω κ 1-a lower bound in Section 4.1, where a can be any real number that belongs to (0, 1). This helps in understanding our intuitions. Then we present a more complicated hard instance that can achieve the desired Ω κ log 1 ε lower bound in Section 4.2. 4.1 Ω κ 1-a LOWER BOUND We introduce a hard function f T,c,σ : R T → R for first-order algorithms: f T,c,σ (x) = q(x) + σ T i=1 v T,c (x i ), where q(x) = 1 2 x 2 1 + 1 2 T -1 i=1 (x i+1 -x i ) 2 (13) is a quadratic function, and we define v T,c as follows: v T,c (x) =          1 2 x 2 , x ≤ 1 -1 32 T -c , 1 2 x 2 -16T c (x -1 + T -c ) 2 , 1 -1 32 T -c < x ≤ 1, 1 2 x 2 -1 32 T -c + 16T c (x -1 -T -c ) 2 , 1 < x ≤ 1 + 1 32 T -c , 1 2 x 2 -1 32 T -c , x > 1 + 1 32 T -c ( ) where T is a positive integer, and c, σ are positive real numbers. The function q can be rewritten as q(x) = 1 2 x T Ax, where A =       2 -1 -1 2 -1 . . . . . . . . . -1 2 -1 -1 1       (16) is a positive-definite symmetric matrix. From the definition of v T,c in ( 14), we obtain the derivative of v T,c , which is: v ′ T,c (x) =        x, x ≤ 1 -1 32 T -c , x -32T c x -1 + 1 32 T -c , 1 -1 32 T -c < x ≤ 1, x + 32T c x -1 -1 32 T -c , 1 < x ≤ 1 + 1 32 T -c , x, x > 1 + 1 32 T -c . ( ) As we can see in ( 17), v ′ T,c is a piecewise linear function. To further simplify notations, we define b T,c (x) in ( 18): x) . Figure 1 provides geometric view of v T,c and b T,c . The quadratic part q is a translation of "the worst function in the (convex) world" in Nesterov (2003) , and the definition of v T,c is inspired by the hard instance in Carmon et al. (2021) . Our hard instance differs from previous ones mainly in the large Lipschitz constant of its gradient. We note that the controlled degree of nonsmoothness is crucial for our estimate of PL constant. In Lemma 1 we list some important properties of f T,c,σ , which we prove in Section ??. b T,c (x) = 1 -32T c |x -1|, 1 -1 32 T -c ≤ x ≤ 1 + 1 32 T -c , 0, otherwise. ( ) Then v ′ T,c (x) = x -b T,c Lemma 1. If σ < 1 and T c ≥ 1 2 σ -1 , then f T,c,σ satisfies the following. 1. g T,c,σ (x) = f T,c,σ (1 -x) is a zero-chain. 2. x * = 0, f * T,c,σ = 0, f T,c,σ (x) ≤ 1 2 x T (A + σI)x. 3. f T,c,σ is (4 + σ + 32σT c )-smooth. 4. f T,c,σ satisfies the 1 C1T 1+5c -PL condition, where C 1 is a universal constant. Now we study the lower bound of zero-respecting algorithms first. Let f denote the following function: f (x) = LT -1 D 2 42σT c f T,c,σ 1 -T 1/2 D -1 x , ( ) where D is the distance between 0 and x * , and T , σ are parameters to be specified later. A change in D affects f (0) -f (x * ) and ∥x * -0∥, but does not affect the condition number of f . In Lemma 2, we show that f is difficult to optimize for all first-order zero-respecting algorithms. Lemma 2. If σ < 1 and T c > 1 2 σ -1 , a first-order zero-respecting algorithm with x (0) = 0 needs at least T 2 gradient computations to find a point x satisfying f (x) -f * ≤ 1 16 ( f (x (0) ) -f * ). We give the full proof of Lemma 2 in Appendix B.1 . The key point is that each gradient access of a zero-respecting algorithm only reveals one coordinate of a zero-chain. For i > T 2 , x (k) i remains unchanged when k ≤ T 2 , which gives a lower bound of the function value after the first T 2 gradient computations. With Lemma 1 and Lemma 2 in hand, we are ready to give our lower complexity bound for zerorespecting first-order algorithms. Theorem 2. Given L ≥ µ > 0 and c < 0.01. When κ = L µ > C 2 where C 2 is a universal constant, we let T ∈ 100 C 2 κ 1/(1+6c) , 200 C 2 κ 1/(1+5c) ∩ Z, and σ = 100 C 2 κT -(1+6c) . Then, f is L-smooth and µ-PL. Moreover, any first-order zero-respecting algorithm with x (0) = 0 needs at least Ω κ 1/(1+6c) gradient computations to find a point x satisfying f (x) -f * ≤ 1 16 ( f (x (0) ) -f * ). The proof of Theorem 2 is provided in Section ??. For a satisfying a 6(1-a) < 0.01, let c = a 6(1-a) , then by Theorem 2, any zero-respecting algorithm needs at least Ω(κ 1-a ) gradient computations to find an ε-approximate optimal point of the function f . Using the technique of Nemirovskij & Yudin (1983) , for specific function classes such as PL functions, a lower complexity bound on first-order zero-respecting algorithms is also a lower complexity bound on all the first-order algorithms. Denoting the set of all first-order algorithms by A (1) , we have the following lemma: Lemma 3. T ε A (1) , P(∆, µ, L) ≥ T ε (A zr , P(∆, µ, L)) . Note that given L and µ, the value f (0) can be controlled by choosing an appropriate D. Lemma 2 and 3 leads to the main result of the paper: Theorem 3. For any 0 < a < 1, when ε ≤ 1 16 ∆, T ε A (1) , P(∆, µ, L) ≥ Ω(κ 1-a ) Discussion about parameter setting In the definition of f , f has hyper-parameters T , σ, c. From our construction, to achieve a lower bound of Ω(κ 1-a ) when a tends to zero, c also tends to zero. Parameters T and σ are chosen according to (20) and ( 21). When c tends to zero, T = Θ(κ) and σ tends to 1. 4.2 Ω κ log 1 ε LOWER BOUND We show that the Ω(κ) lower bound can be further improved to Ω κ log 1 ε with a new hard instance based on f T,c,σ and a similar technique to estimate the PL constant. Detailed proof of Lemmas and Theorems in this subsection is provided in Appendix C . We first introduce several components of the new hard instance. We define v y (x) =          1 2 x 2 , x ≤ 31 32 y,  where y > 0 is a constant. By the definition of v y , we have v ′ y (x) =        x, x ≤ 31 32 y, x -32 x -31 32 y , 31 32 y < x ≤ y, x + 32 x -33 32 y , y < x ≤ 33 32 y, x, x > 33 32 y. (25) For the non-convex part, we define b y (x) = y -32|x -y|, 31 32 y ≤ x ≤ 33 32 y, 0, otherwise. (26) Then we have v ′ y (x) = x -b y (x). For the convex part, we define q T,t (x) as follows (for the convenience of notation, we define x 0 = 0): q T,t (x) = 1 2 t-1 i=0   7 8 x iT -x iT +1 2 + T -1 j=1 (x iT +j+1 -x iT +j ) 2   , where x ∈ R T t . q T,t is a quadratic function of x, thus can be written as q T,t (x) = 1 2 x T Bx, ( ) where B is a positive semi-definite symmetric matrix. Like A, B also satisfies 0 ⪯ B ⪯ 4I, because the sum of absolute value of non-zero entries of each row of B is smaller or equal to 4. Let y ∈ R T t be a vector satisfying y qT +b = 7 8 q , where q ∈ N, b ∈ {1, 2, • • • , T }. We define the hard instance g T,t : R T t → R as follows: g T,t (x) = q T,t (x) + T t i=1 v yi (x i ). Now we list some properties of g T,t in Lemma 4. Lemma 4. f T,c,σ satisfies the following. 1. g T,t (y -x) is a zero-chain. 2. x * = 0, g * T,t = 0, g T,t (x) ≤ 1 2 x T (B + I)x. 3. g T,t is 37-smooth.

4.. g T,t satisfies the 1

C3T -PL condition, where C 3 is a universal constant. Define g to be the following function, which is hard for first-order algorithms: g(x) = LT -1 D 2 37 g T,t y -T 1/2 D -1 x , Similar to Lemma 2, we show that g is hard for first-order zero-respecting algorithms: Lemma 5. Assume that ε < 0.01 and let t = 2 log 8 7 3 2ε . A first-order zero-respecting algorithm with x (0) = 0 needs at least 1 2 T t gradient computations to find a point x satisfying g(x) -g * ≤ ε(g(x (0) ) -g * ). With Lemma 4 and 5, we obtain a lower bound for zero-respeting algorithms: Theorem 4. Given L ≥ µ > 0. When κ = L µ > C 4 where C 4 is a universal constant, there exists T and t such that g is L-smooth and µ-PL. Moreover, any first-order zero-respecting algorithm with x (0) = 0 needs at least Ω κ log 1 ε gradient computations to find a point x satisfying g(x) -g * ≤ ε(g(x (0) ) -g * ). Finally, we arrive at a lower bound for first-order algorithms using Lemma 3: Theorem 5. For any 0 < a < 1, when ε ≤ 1 16 ∆, T ε A (1) , P(∆, µ, L) ≥ Ω κ log 1 ε This bound matches the convergence rate of Gradient Descent up to a constant. 

5. NUMERICAL EXPERIMENTS

We conduct numerical experiments on our hard instance. We consider the κ relatively large, which can reduce the factors from the numerical constants. We κ = 4.8384 × 10 8 , c = 0.001 in Figure 2a and κ = 4.8384 × 10 10 , c = 0.001 in Figure 2b , then compute corresponding T and σ according to ( 20) and ( 21). We use Gradient Descent, Nesterov's Accelerated Gradient Descent (AGD) and Polyak's Heavy-ball Method to optimize the hard instance. As AGD and the Heavy-ball Method are designed for convex functions, we need to choose appropriate parameter μ in both algorithms, because our hard instance is non-convex. The first choice is µ, which is the PL constant of our hard instance. The second choice is σ, because when x i is far from 1, our hard instance can be treated as a σ-strongly convex quadratic function. From Figure 2 ,we observe that the convergence can be roughly divided into two phases. In phase one, the optimization methods tries to pull every coordinate of x away from 1, and the function value does not decrease much. In phase two they converge linearly. This may be due to the fact that when each xi is far from 1, our hard instance is exactly a strongly convex quadratic function. The observation is consistent with our theoretical results.

6. CONCLUSION

We construct a lower complexity bound on optimizing smooth PL functions with first-order methods. A first-order algorithm needs at least Ω L µ log 1 ε gradient access to find an ε-approximate optimal point of an L-smooth µ-PL function. Our lower bound matches the convergence rate of Gradient Descent up to constants.



); Nouiehed et al. (2019); Hardt & Ma (2016); Fazel et al. (2018); J Reddi et al. (2016); Lei et al. (2017).

Figure 1: Illustration of v T,c , v ′ T,c and b T,c .

κ = 4.8384 × 10 10

Figure 2: Convergence rate under Gradient Descent, Nesterov's Accelerated Gradient Descent and Polyak's Heavy-ball Method

acknowledgement

We only focus on deterministic algorithms in this paper. We conjecture that our results can be extended to randomized algorithms, using the same technique in Nemirovskij & Yudin (1983) and explicit construction in Woodworth & Srebro (2016) and Woodworth & Srebro (2017) . We leave its formal derivation to the future work.

