LINEAR CONVERGENCE AND IMPLICIT REGULAR-IZATION OF GENERALIZED MIRROR DESCENT WITH TIME-DEPENDENT MIRRORS Anonymous

Abstract

The following questions are fundamental to understanding the properties of overparameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum? (2) What form of implicit regularization occurs through training? While significant progress has been made in answering both of these questions for gradient descent, they have yet to be answered more completely for general optimization methods. In this work, we establish sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a simple analysis under which non-stochastic GMD converges linearly to a global minimum. We then present a novel, Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain approximate implicit regularization results for GMD by proving that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in 2 -norm in the dual space.



Recently, Azizan & Hassibi (2019) ; Azizan et al. (2019) simultaneously proved convergence and analyzed approximate implicit regularization for mirror descent (Beck & Teboulle, 2003; Nemirovsky & Yudin, 1983) . In particular, by using the fundamental identity of stochastic mirror descent (SMD), they proved that SMD converges to an interpolating solution that is approximately the closest one to the initialization in Bregman divergence. However, these works do not provide a rate of convergence for SMD and assume that there exists an interpolating solution within in Bregman divergence from the initialization. In this work, we provide sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), an extension of mirror descent that introduces (1) a potential-free update rule and (2) a time-dependent mirror; namely, GMD with invertible φ : R d → R d and learning rate η is used to minimize a real valued loss function, f , according to the update rule: φ (t) (w (t+1) ) = φ (t) (w (t) ) -η∇f (w (t) ). (1) We discuss the stochastic version of GMD (SGMD) in Section 3. GMD generalizes both mirror descent and preconditioning methods. Namely, if for all t, φ (t) = ∇ψ for some strictly convex function ψ, then GMD corresponds to mirror descent with potential ψ; if φ (t) = G (t) for some invertible matrix G (t) ∈ R d×d , then the update rule in equation ( 1) reduces to w (t+1) = w (t) -ηG (t) -foot_0 ∇f (w (t) ) and hence represents applying a pre-conditioner to gradient updates. The following is a summary of our results: 1. We provide a simple proof for linear convergence of GMD under the Polyak-Lojasiewicz inequality (Theorem 1). 2. We provide sufficient conditions under which SGMD converges linearly under an adaptive learning rate (Theorems 2 and 3) 1 . 3. As corollaries to Theorems 1 and 3, in Section 5 we provide sufficient conditions for linear convergence of stochastic mirror descent as well as stochastic preconditioner methods such as Adagrad (Duchi et al., 2011) . 4. We prove the existence of an interpolating solution and linear convergence of GMD to this solution for non-negative loss functions that locally satisfy the PL * inequality (Liu et al., 2020) . This result (Theorem 4) provides approximate implicit regularization results for GMD: GMD converges linearly to an interpolating solution that is approximately the closest interpolating solution to the initialization in 2 norm in the dual space induced by φ (t) .

2. RELATED WORK

Recent work (Azizan et al., 2019) established convergence of stochastic mirror descent (SMD) for nonlinear optimization problems. It characterized the implicit bias of mirror descent by demonstrating that SMD converges to a global minimum that is within epsilon of the closest interpolating solution in Bregman divergence. The analysis in Azizan et al. (2019) relies on the fundamental identity of SMD and does not provide explicit learning rates or establish a rate of convergence for SMD in the nonlinear setting. The work in Azizan & Hassibi (2019) provided explicit learning rates for the convergence of SMD in the linear setting under strongly convex potential, again without a rate of convergence. While these works established convergence of SMD, prior work by Gunasekar et al. (2018) analyzed the implicit bias of SMD without proving convergence. A potential-based version of generalized mirror descent with time-varying regularizes was presented for online problems in Orabona et al. (2015) . That work is primarily concerned with establishing regret bounds for the online learning setting, which differs from our setting of minimizing a loss function given a set of known data points. A potential-free formulation of GMD for the flow was presented in Gunasekar et al. (2020) . The Polyak-Lojasiewicz (PL) inequality (Lojasiewicz, 1963; Polyak, 1963) serves as a simple condition for linear convergence in non-convex optimization problems and is satisfied in a number of settings including over-parameterized neural networks (Liu et al., 2020) . Work by Karimi et al. (2016) demonstrated linear convergence of a number of descent methods (including gradient descent) under the PL inequality. Similarly, Vaswani et al. (2019) proved linear convergence of stochastic gradient descent (SGD) under the PL inequality and the strong growth condition (SGC), and Bassily et al. (2018) established the same rate for SGD under just the PL inequality. Soltanolkotabi et al. (2019) also used the PL inequality to establish a local linear convergence result for gradient descent on 1 hiddden layer over-parameterized neural networks. Instead of focusing on a specific method, the goal of this work is to establish sufficient conditions for linear convergence by applying the PL inequality to a more general setting (SGMD). We arrive at linear convergence for specific methods such as mirror descent and preconditioned gradient descent methods as corollaries. Moreover, our local convergence results provide an intuitive formulation of approximate implicit regularization for GMD and thus mirror descent. Namely, instead of resorting to Bregman divergence, we prove that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in 2 norm in the dual space induced by φ (t) .

3. ALGORITHM DESCRIPTION AND PRELIMINARIES

We begin with a formal description of SGMD. Let f i : R d → R denote real-valued, differentiable loss functions and let f (x) = 1 n n i=1 f i (x). In addition, let φ (t) : R d → R d be an invertible function for all non-negative integers t. We solve the optimization problem arg min x∈R d f (x) using stochastic generalized mirror descent with learning rate ηfoot_1 : φ (t) (w (t+1) ) = φ (t) (w (t) ) -η∇f it (w (t) ), where i t ∈ [n] is chosen uniformly at random. As described in the introduction, the above algorithm generalizes both gradient descent (where φ(x) = x) and mirror descent (where φ (t) (x) = ∇ψ(x) for some strictly convex potential function ψ). In the case where φ (t) (x) = G (t) x for an invertible matrix G (t) ∈ R d×d , the update rule in equation ( 2) reduces to: w (t+1) = w (t) -ηG (t) -1 ∇f it (w (t) ) Hence, when φ (t) is an invertible linear transformation, Equation ( 2) is equivalent to pre-conditioned gradient descent. We now present the Polyak-Lojasiewicz inequality and lemmas from optimization theory that will be used in our proofsfoot_2 . Polyak-Lojasiewicz (PL) Inequality. A function f : R d → R is µ-PL if for some µ > 0: 1 2 ∇f (x) 2 ≥ µ(f (x) -f (x * )) ∀x ∈ R d , where x * ∈ R d is a global minimizer for f . A useful variation of the PL inequality is the PL * inequality introduced in Liu et al. (2020) which does not require knowledge of f (x * ). Definition. A function f : R d → R is µ-PL * if for some µ > 0: 1 2 ∇f (x) 2 ≥ µf (x) ∀x ∈ R d , A function that is µ-PL * is also µ-PL when f is non-negative. Additionally, we will typically assume that f is L-smooth (with L-Lipschitz continuous derivative). Definition. A function f : R d → R is L-smooth for L > 0 if for all x, y ∈ R d : ∇f (x) -∇f (y) ≤ L x -y . If φ (t) (x) = x for any t and x ∈ R d then SGMD reduces to SGD. If f is L-smooth and satisfies the PL-Inequality, then SGD converges linearly to a global minimum (Bassily et al., 2018; Karimi et al., 2016; Vaswani et al., 2019) . Moreover, the following lemma (proven in Appendix A) shows that the PL * condition implies the existence of a global minimum x * for non-negative, L-smooth f . Lemma 1. If f : R d → R is µ-PL * , L-smooth and f (x) ≥ 0 for all x ∈ R d , then gradient descent with learning rate η < 2 L converges linearly to x * satisfying f (x * ) = 0. Hence, in cases where the loss function is nonnegative (for example the squared loss), we can remove the usual assumption about the existence of a global minimum, x * , and instead assume that f satisfies the PL * inequality. We now reference standard properties of L-smooth functions (Zhou, 2018) , which will be used in our proofs. Lemma 2. If f : R d → R is L-smooth, then for all x, y ∈ R d : (a) f (y) ≤ f (x) + ∇f (x), y -x + L 2 y -x 2 , (b) ∇f (x) 2 ≤ 2L(f (x) -f (x * )). The following lemma relates µ and L (the proof is in Appendix B). Lemma 3. If f : R d → R is µ-PL and L-smooth, then µ ≤ L. Using Lemma 2b in place of the strong growth condition (i.e. E i [ ∇f i (x) 2 ] ≤ ρ ∇f (x) 2 ) yields slightly different learning rates when establishing convergence of stochastic descent methods (as is apparent from the different learning rates between Bassily et al. (2018) and Vaswani et al. (2019) ). The following simple lemma will be used in the proof of Theorem 3. Lemma 4. If f (x) = 1 n n i=1 f i (x) where f i : R d → R are L i -smooth , then f is sup i L i -smooth. Note that there could exist some other constant L < sup i L i for which f is L -smooth, but this upper bound suffices for our proof of Theorem 3. Lastly, we define and reference standard properties of strongly convex functions (Zhou, 2018) , which will be useful in demonstrating how our GMD results generalize those for mirror descent. Definition. For α > 0, a differentiable function, ψ : R d → R, is α-strongly convex if for all x, y, ψ(y) ≥ ψ(x) + ∇ψ(x), y -x + α 2 y -x 2 . Lemma 5. If ψ : R d → R is α-strongly convex, then for all x, y: ψ(y) ≤ ψ(x) + ∇ψ(x), y -x + 1 2α ∇ψ(y) -∇ψ(x) 2 . With these preliminaries in hand, we now present our proofs for linear convergence of SGMD using the PL-Inequality.

4. SUFFICIENT CONDITIONS FOR LINEAR CONVERGENCE OF SGMD

In this section, we provide sufficient conditions to establish (expected) linear convergence for (stochastic) GMD. We first provide simple conditions under which GMD converges linearly by extending the proof strategy from Karimi et al. (2016) . We then present alternate conditions for linear convergence of GMD, which can be naturally extended to the stochastic setting.

4.1. SIMPLE CONDITIONS FOR LINEAR CONVERGENCE OF GMD

We begin with a simple set of conditions under which (non-stochastic) GMD converges linearly (the full proof is presented in Appendix C). The main benefit of this analysis is that it is a straightforward extension of the proof of linear convergence for gradient descent under the PL-Inequality presented in Karimi et al. (2016) . Theorem 1. Suppose f : R d → R is L-smooth and µ-PL and φ (t) : R d → R d is an invertible, α (t) u -Lipschitz function where lim t→∞ α (t) u < ∞. If for all x, y ∈ R d and for all timesteps t there exist α (t) l > 0 such that φ (t) (x) -φ (t) (y), x -y ≥ α (t) l x -y 2 , and lim t→∞ α (t) l > 0, then generalized mirror descent converges linearly to a global minimum for any η (t) < 2α (t) l L . Remark. Theorem 1 yields a fixed learning rate provided that α (t) l is uniformly bounded. In addition, note that Theorem 1 applies also under weaker assumptions, namely when φ (t) is locally Lipschitz. Finally, the provided learning rate can be computed exactly for settings such as linear regression, since it only requires knowledge of L and α (t) l (see Section 7). When η = α (t) l L and given w * a minimizer of f , the proof of Theorem 1 implies that:  f (w (t+1) ) -f (w * ) ≤   1 - µα (t) l 2 Lα (t) u 2   (f (w (t) ) -f (w * )). Letting κ (t) = Lα (t) α (t) l α (t) u is decreasing in t, the rate is given by: f (w (t+1) ) -f (w * ) ≤ 1 - 1 κ t+1 (f (w (0) ) -f (w * )).

4.2. TAYLOR SERIES ANALYSIS FOR LINEAR CONVERGENCE IN GMD

Although the proof of Theorem 1 is succinct, it is nontrivial to extend to the stochastic settingfoot_3 . In order to develop a convergence result for the stochastic setting, we turn to an alternate set of conditions for linear convergence by using the Taylor expansion of φ -1 . We use J φ to denote the Jacobian of φ. For ease of notation, we consider non-time-dependent α l , α u , but our results are trivially extendable to the setting when these quantities are time-dependent. Theorem 2. Suppose f : R d → R is L-smooth and µ-PL and φ : R d → R d is an infinitely differentiable, analytic function with analytic inverse, φ -1 . If there exist α l , α u > 0 such that (a) α l I J φ α u I, (b) |∂ i1,...i k φ -1 j (x)|≤ k! 2α u d ∀x ∈ R d , i 1 , . . . i k ∈ [d], j ∈ [d], k ≥ 2, then generalized mirror descent converges linearly for any η (t) < min 4α 2 l 5Lαu , 1 2 √ d ∇f (w (t) ) . The full proof is provided in Appendix D. Importantly, the adaptive component of the learning rate is only used to ensure that the sum of the higher order terms for the Taylor expansion converges. In particular, if φ (t) is a linear function, then our learning rate no longer needs to be adaptive. Note that alternatively, we can establish linear convergence for a fixed learning rate given that the gradients monotonically decrease or if f is non-negative and µ-PL * . We analyze this case in Appendix E and provide an explicit condition on µ and L under which this holds.

4.3. TAYLOR SERIES ANALYSIS FOR LINEAR CONVERGENCE IN STOCHASTIC GMD

The main benefit of the above Taylor series analysis is that it naturally extends to the stochastic setting as demonstrated in the following result (with proof presented in Appendix F). Theorem 3. Suppose f (x) = 1 n n i=1 f i (x) where f i : R d → R are non-negative, L i -smooth functions with L = sup i∈[n] L i and f is µ-PL * . Let φ : R d → R d be an infinitely differentiable, analytic function with analytic inverse, φ -1 . SGMD is used to minimize f according to the updates: φ(w (t+1) ) = φ(w (t) ) -η (t) ∇f it (w (t) ), where i t ∈ [n] is chosen uniformly at random and η (t) is an adaptive step size. If there exist α l , α u > 0 such that: (a) α l I J φ α u I, (b) |∂ i1,...i k φ -1 j (x)|≤ k! µ 2α u dL ∀x ∈ R d , i 1 , . . . i k ∈ [d], j ∈ [d], k ≥ 2, then SGMD with η (t) < min 4µα 2 l 5L 2 αu , 1 2 √ d maxi ∇fi(w (t) ) converges linearly to a global minimum. Remark. Note that there is a slight difference between the learning rate in Theorem 2 and Theorem 3 due to a multiplicative factor of µ. Consistent with the difference in learning rates between Bassily et al. (2018) and Vaswani et al. (2019) , we can make the learning rate between the two theorems match if we assume the strong growth condition (i.e. E i [ ∇f i (x) 2 ] ≤ ρ ∇f (x) 2 ) with ρ = µ instead of using Lemma 2b. Moreover, as max i ∇f i (w (t) ) ≤ 2nLf (w (0) ), we establish linear convergence for a fixed step size η < min 4µα 2 l 5L 2 αu , 1 2 √ 2dnLf (w (0) ) as well.

5. COROLLARIES OF LINEAR CONVERGENCE IN SGMD

We now present how the linear convergence results established by Theorems 1, 2, and 3 apply to commonly used optimization algorithms including mirror descent and Adagrad. In this section, we primarily extend the analysis from Theorem 1 for the non-stochastic case. However, our results can be extended analogously to give expected linear convergence in the stochastic case by using the extension provided in Theorem 3. Gradient Descent. For the case of gradient descent, φ(x) = x and so α l = α u = 1. Hence, we see that gradient descent converges linearly under the conditions of Theorem 1 with η < 2 L , which is consistent with the analysis in Karimi et al. (2016) . Mirror Descent. Let ψ : R d → R be a strictly convex potential. Thus, φ(x) = ∇ψ(x) is an invertible function. If ψ is α l -strongly convex and (locally) α u -Lipschitz and f is L-smooth and µ-PL, then the conditions of Theorem 1 are satisfied. Moreover, since the α u -Lipschitz condition holds locally for most potentials considered in practice, our result implies linear convergence for mirror descent with α l -strongly convex potential ψ. Adagrad. Let φ (t) = G (t) 1 2 where G (t) is a diagonal matrix such that G (t) i,i = t k=0 ∇f i (w (k) ) 2 . Then GMD corresponds to Adagrad. In this case, we can apply Theorem 1 to establish linear convergence of Adagrad under the PL-Inequality provided that φ (t) satisfies the condition of Theorem 1. The following corollary proves that this condition holds and hence that Adagrad converges linearly. Corollary 1. Let f : R d → R be an L-smooth function that is µ-PL. Let α (t) l 2 = min i∈[d] G (t) i,i and α (t) u 2 = max i∈[d] G (t) i,i . If lim t→∞ α (t) l α (t) u = 0, then Adagrad converges linearly for adaptive step size η (t) = α (t) l L . The proof is presented in Appendix H. While Corollary 1 can be extended to the stochastic setting via Theorem 3, it requires knowledge of µ to setup the learning rate, and the resulting learning rate provided is typically smaller than what we can use in practice. We analyze this case further in Section 7. Additionally, since the condition lim t→∞ α (t) l α (t) u = 0 is difficult to verify in practice, we provide Corollary 2 in Appendix H, which presents a verifiable condition under which Adagrad converges linearly.

6. LOCAL CONVERGENCE AND IMPLICIT REGULARIZATION IN GMD

In the previous sections, we established linear convergence for GMD for real-valued loss, f : R d → R, that is µ-PL for all x ∈ R d . In this section, we show that f need only satisfy the PL inequality locally (i.e. within a ball of fixed radius around the initialization) in order to establish linear convergence. The following theorem (proof in Appendix G) extends Theorem 4.2 from Liu et al. (2020) to GMD and uses the PL * condition to establish both the existence of a global minimum and linear convergence to this global minimum under GMDfoot_4 . We use B(x, R) = {z ; z ∈ R d , x -z 2 ≤ R} to denote the ball of radius R centered at x.  (0) ), R)} with R = 2 √ 2L √ f (w (0) )α 2 u α l µ . If for all x, y ∈ R d there exists α l > 0 such that φ(x) -φ(y), x -y ≥ α l x -y 2 ,

then,

(1) There exists a global minimum w (∞) ∈ B. (2) GMD converges linearly to w (∞) for η = α l L . (3) If w * = arg min w∈ B ; f (w)=0 φ(w) -φ(w (0) ) , then, φ(w * ) -φ(w (∞) ) ≤ 2R. Approximate Implicit Regularization in GMD. When R is small, we can view the result of Theorem 4 as a characterization of the solution selected by GMD, thereby obtaining approximate implicit regularization results for GMD. Namely, for δ = R 2 , we have φ(w * ) -φ(w ∞ ) ≤ δ. Hence provided that R is small (which holds for small f (w (0) )), GMD selects an interpolating solution that is close to w * in 2 -norm in the dual space induced by φ. This view is consistent with the characterization of approximate implicit regularization in Azizan et al. ( 2019), as is shown by Corollary 3 in Appendix I). In particular, Corollary 3 implies the assumptions used in Azizan et al. ( 2019) for the full batch case by proving (1) the existence of such a w (∞) , (2) linear convergence of w (0) to w (∞) , and (3) providing explicit forms for (where = R 2 above). Importantly, the approximate implicit regularization result for mirror descent does not need to be stated in terms of Bregman divergence, but can be viewed more naturally as ∇ψ(w (∞) ) -∇ψ(w * ) 2 being small.

7. EXPERIMENTAL VERIFICATION OF OUR THEORETICAL RESULTS

We now present a simple set of experiments under which we can explicitly compute the learning rates in our theorems. We will show that in accordance with our theory, both fixed and adaptive versions of these learning rates yield linear convergence. We focus on computing learning rates for Adagrad in the noiseless regression setting used in Xie et al. (2020) . Namely, we are given (X, y) ∈ R n×d × R n such that there exists a w * ∈ R d such that Xw * = y. If n < d, then the system is over-parameterized, and if n ≥ d, the system is sufficiently parameterized and has a unique solution. In this setting, the squared loss (MSE) is L-smooth with L = λ max (XX T ), and it is µ-PL with µ = λ min (XX T ) where λ max and λ min refer to the largest and smallest non-zero eigenvalues, respectivelyfoot_5 . Moreover, for Adagrad, we can compute α (t) l = min i∈[d] ( t k=0 ∇f i (w (k) )) 2 and α (t) u = max i∈[d] ( t k=0 ∇f i (w (k) )) 2 at each timestep. Hence for Adagrad in the noiseless linear regression setting, we can explicitly compute the learning rate provided in Theorem 3 for the stochastic setting and in Corollary 1 for the full batch setting. Figure 1 demonstrates that in both, the over-parameterized and sufficiently parameterized settings, our provided learning rates yield linear convergence. In the stochastic setting, the theory for fixed learning rates suggests a very small rate (≈ 10 -9 for Figure 1d ) and hence we chose to only present the more reasonable adaptive step size as a comparison. In the full batch setting, the learning rate obtained from our theorems out-performs using the standard fixed learning rate of 0.1, while performance is comparable for the stochastic setting. Interestingly, our theory suggests an adaptive learning rate that is increasing (in contrast to the usual decreasing learning rate schedules). In particular, while the suggested learning rate for Figure 1a starts at 0.99, it increases to 1.56 at the end of training. In Appendix J, we present experiments on over-parameterized neural networks. While the PLcondition holds in this setting (Liu et al., 2020) , it can be difficult to compute the smoothness parameter L (which was the motivation for developing Adagrad-Norm). Interestingly, our experiments demonstrate that our increasing adaptive learning rate from Theorem 1, using an approximation for L, provides convergence for Adagrad in over-parameterized networks. The link to the code is provided in Appendix J.

8. CONCLUSION

In this work, we presented stochastic generalized mirror descent, which generalizes both mirror descent and pre-conditioner methods. By using the PL-condition and a Taylor-series based analysis, we provided sufficient conditions for linear convergence of SGMD in the non-convex setting. As a corollary, we obtained sufficient conditions for linear convergence of both mirror descent and preconditioner methods such as Adagrad. Lastly, we prove the existence of an interpolating solution and linear convergence of GMD to this solution for non-negative loss functions that are locally PL * . Importantly, our local convergence results allow us to obtain approximate implicit regularization results for GMD. Namely, we prove that GMD linearly converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in 2 norm in the dual space. For the full batch setting, this result provides a more natural characterization of implicit regularization in terms of 2 norm in the dual space, as opposed to Bregman divergence. Looking ahead, we envision that the generality of our analysis (and the PL-condition) could provide useful in the analysis of other commonly used adaptive methods such as Adam (Kingma & Ba, 2015) . Moreover, since the PL-condition holds in varied settings including over-parameterized neural networks (Liu et al., 2020) , it would be interesting to analyze whether the learning rates obtained here provide an improvement for convergence in these modern settings.

APPENDIX A PROOF OF LEMMA 1

We restate the lemma below. Lemma. If f : R d → R is µ-PL * , L-smooth and f (x) ≥ 0 for all x ∈ R d , then gradient descent with learning rate η < 2 L converges linearly to x * satisfying f (x * ) = 0. Proof. The proof follows exactly from Theorem 1 of Karimi et al. (2016) . Since f is L-smooth, by Lemma 2a it holds that: f (w (t+1) ) -f (w (t) ) ≤ ∇f (w (t) ), w (t+1) -w (t) + L 2 w (t+1) -w (t) 2 . =⇒ f (w (t+1) -f (w (t) ) ≤ -η ∇f (w (t) ) 2 + L 2 η 2 ∇f (w (t) ) 2 =⇒ f (w (t+1) -f (w (t) ) ≤ -η + η 2 L 2 2µf (w (t) ) =⇒ f (w (t+1) ≤ 1 -2µη + µη 2 L f (w (t) ) Hence, if η < 2 L , then C = 1 -2µη + µη 2 L < 1. Thus, we have f (w (t+1) ) ≤ Cf (w (t) ) for C < 1. Thus, as f is bounded below by 0 and the sequence {f (w (t) )} t∈N monotonically decreases with infimum 0, the monotone convergence theorem implies lim t→∞ f (w (t) ) = 0.

B PROOF OF LEMMA 3

Proof. From Lemma 2 and from the PL condition, we have: 2µ(f (x) -f (x * )) ≤ ∇f (x) 2 ≤ 2L(f (x) -f (x * )) =⇒ µ ≤ L C PROOF OF THEOREM 1 Proof. Since f is L-smooth, by Lemma 2a it holds that: f (w (t+1) ) -f (w (t) ) ≤ ∇f (w (t) ), w (t+1) -w (t) + L 2 w (t+1) -w (t) 2 . ( ) Now by the condition on φ (t) in Theorem 1, we bound the first term on the right as follows: φ (t) (w (t+1) ) -φ (t) (w (t) ), w (t+1) -w (t) ≥ α (t) l w (t+1) -w (t) 2 =⇒ -η∇f (w (t) ), w (t+1) -w (t) ≥ α (t) l w (t+1) -w (t) 2 using Equation (2) =⇒ ∇f (w (t) ), w (t+1) -w (t) ≤ - α (t) l η w (t+1) -w (t) 2 . Substituting this bound back into the inequality in (5), we obtain f (w (t+1) ) -f (w (t) ) ≤ - α (t) l η + L 2 w (t+1) -w (t) 2 . Since the learning rate is selected so that the coefficient of w (t+1) -w (t) 2 on the right is negative, we obtain f (w (t+1) ) -f (w (t) ) ≤ - α (t) l η + L 2 w (t+1) -w (t) 2 ≤ - α (t) l η + L 2 1 α (t) u 2 φ (t) (w (t+1) ) -φ (t) (w (t) ) 2 = - α (t) l η + L 2 1 α (t) u 2 -η∇f (w (t) ) 2 using Equation (1) ≤ - α (t) l η + L 2 2µ η 2 α (t) u 2 (f (w (t) ) -f (w * )) as f is µ-PL =⇒ f (w (t+1) ) -f (w * ) ≤ 1 -2µ ηα (t) l α (t) u 2 + µ Lη 2 α (t) u 2 (f (w (t) ) -f (w * )), where the second inequality follows since φ (t) is α (t) u -Lipschitz. For linear convergence, we need. 0 < 1 -2µ ηα (t) l α (t) u 2 + µ Lη 2 α (t) u 2 < 1. From Lemma 3, µ < α (t) u 2 L α (t) l always holds and implies that the left inequality in ( 6) is satisfied for all η (t) . The right inequality holds by our assumption that η (t) < 2α L , which completes the proof.

D PROOF OF THEOREM 2

We repeat the theorem below for convenience. Theorem. Suppose f : R d → R is L-smooth and µ-PL and φ : R d → R d is an infinitely differentiable, analytic function with analytic inverse, φ -1 . If there exist α l , α u > 0 such that: (a) α l I J φ α u I, (b) |∂ i1,...i k φ -1 j (x)|≤ k! 2α u d ∀x ∈ R d , i 1 , . . . i k ∈ [d], j ∈ [d], k ≥ 2, then generalized mirror descent converges linearly for η (t) < min 4α 2 l 5Lαu , 1 2 √ d ∇f (w (t) ) . Proof. Since f is L-smooth, it holds by Lemma that 2: f (w (t+1) ) -f (w (t) ) ≤ ∇f (w (t) ), w (t+1) -w (t) + L 2 w (t+1) -w (t) 2 . Next, we want to bound the two quantities on the right hand side by a multiple of ∇f (w (t) ) 2 . We do so by expanding w (t+1) -w (t) using the Taylor series for φ -1 as follows: w (t+1) -w (t) = φ -1 (φ(w (t) ) -η∇f (w (t) )) -w (t) = -ηJ φ -1 (φ(w (t) ))∇f (w (t) ) + ∞ k=2 1 k! d i1,i2...i k =1 (-η) k ∂ i1,...i k φ -1 j (φ(w (t) ))(∇f (w (t) ) i1 . . . ∇f (w (t) ) i k ) . The quantity in brackets is a column vector where we only wrote out the j th coordinate for j ∈ [d]. Now we bound the term ∇f (w (t) ), w (t+1) -w (t) : ∇f (w (t) ), w (t+1) -w (t) = -η∇f (w (t) ) T J -1 φ (w (t) )∇f (w (t) ) +∇f (w (t) ) T ∞ k=2 1 k! d i1,i2...i k =1 (-η) k ∂ i1,...i k φ -1 j (φ(w (t) ))(∇f (w (t) ) i1 . . . ∇f (w (t) ) i k ) . We have separated the first order term from the other orders because we will bound them separately using conditions (a) and (b) respectively. Namely, we first have: -η∇f (w (t) ) T J -1 φ (w (t) )∇f (w (t) ) ≤ - η α u ∇f (w (t) ) 2 . Next, we use the Cauchy-Schwarz inequality on inner products to bound the inner product of ∇f (w (t) ) and the higher order terms. In the following, we use α to denote 1 2αud . ∇f (w t) ) . (t) ) T ∞ k=2 1 k! d i1,i2...i k =1 (-η) k ∂ i1,...i k φ -1 j (φ(w (t) ))(∇f (w (t) ) i1 . . . ∇f (w (t) ) i k ) ≤ ∇f (w (t) ) ∞ k=2 1 k! d i1,i2...i k =1 (-η) k ∂ i1,...i k φ -1 j (φ(w (t) ))(∇f (w (t) ) i1 . . . ∇f (w (t) ) i k ) ≤ ∇f (w (t) ) ∞ k=2 αk! k! (η) k d i1,i2...i k =1 (|∇f (w (t) ) i1 |. . . |∇f (w (t) ) i k |) = ∇f (w (t) ) α ∞ k=2 √ d(η) k (|∇f (w (t) ) 1 |+ . . . |∇f (w (t) )) d |) k = ∇f (w (t) ) α ∞ (η) k √ d|     |∇f (w (t) ) 1 | . . . |∇f (w (t) ) d |     , 1 | k ≤ ∇f (w (t) ) α ∞ k=2 (η) k √ d ∇f (w (t) ) k ( √ d) k =α ∞ k=2 ( √ d) k+1 (η) k ∇f (w (t) ) k+1 =α( √ d) 3 (η) 2 ∇f (w (t) ) 3 ∞ k=0 ( √ d) k (η) k ∇f (w (t) ) k = α( √ d) 3 (η) 2 ∇f (w (t) ) 3 1 - √ dη ∇f (w ( Hence we can select η < 1 2 √ d ∇f (w (t) ) such that: α( √ d) 3 (η) 2 ∇f (w (t) ) 3 1 - √ dη ∇f (w (t) ) ≤ α( √ d) 3 (η) 2 ∇f (w (t) ) 3 √ dη ∇f (w (t) ) = dαη ∇f (w (t) ) 2 . Thus, we have established the following bound: ∇f (w (t) ), w (t+1) -w (t) ≤ - η α u + dαη ∇f (w (t) ) 2 = - η 2α u ∇f (w (t) ) 2 . Proceeding analogously as above, we establish a bound on w (t+1) -w (t) 2 : w (t+1) -w (t) 2 ≤ η 2 α 2 l + α 2 d 2 η 2 ∇f (w (t) ) 2 = η 2 α 2 l + η 2 4α 2 u ∇f (w (t) ) 2 . Putting the bounds together we obtain: f (w (t+1) ) -f (w (t) ) ≤ - η 2α u + Lη 2 2α 2 l + Lη 2 8α 2 u ∇f (w (t) ) 2 . We select our learning rate to make the coefficient of ∇f (w (t) 2 negative, and thus by the PLinequality (4), we have: f (w (t+1) ) -f (w (t) ) ≤ - η 2α u + Lη 2 2α 2 l + Lη 2 8α 2 u 2µ(f (w (t) ) -f (w * )) =⇒ f (w (t+1) ) -f (w * ) ≤ 1 - µη α u + µLη 2 α 2 l + µLη 2 4α 2 u (f (w (t) ) -f (w * )). Now, in order to have a solution to this system, we must ensure that the discriminant of the quadratic equation in η when considering the right hand side inequality is larger than zero. In particular we require: µ 2 α 2 u -4 1 - µ L µL α 2 l + µL 4α 2 u > 0 =⇒ µ L > 4α 2 u + α 2 l 4α 2 u + 2α 2 l , which completes the proof.

F PROOF OF THEOREM 3

We repeat the theorem below for convenience. Theorem. Suppose f (x) = 1 n n i=1 f i (x) where f i : R d → R are non-negative, L i -smooth functions with L = sup i∈[n] L i and f is µ-PL * . Let φ : R d → R d be an infinitely differentiable, analytic function with analytic inverse, φ -1 . SGMD is used to minimize f according to the updates: φ(w (t+1) ) = φ(w (t) ) -η (t) ∇f it (w (t) ), where i t ∈ [n] is chosen uniformly at random and η (t) is an adaptive step size. If there exist α l , α u > 0 such that: (a) α l I J φ α u I, (b) |∂ i1,...i k φ -1 j (x)|≤ k! µ 2α u dL ∀x ∈ R d , i 1 , . . . i k ∈ [d], j ∈ [d], k ≥ 2, then SGMD converges linearly to a global minimum for any η (t) < min 4µα 2 l 5L 2 αu , 1 2 √ d maxi ∇fi(w (t) ) . Proof. We follow the proof of Theorem 2. Namely, Lemma 4 implies that f is L-smooth and hence f (w (t+1) ) -f (w (t) ) ≤ ∇f (w (t) ), w (t+1) -w (t) + L 2 w (t+1) -w (t) 2 . As before, we want to bound the two quantities on the right by ∇f (w (t) ) 2 . Following the bounds from the proof of Theorem 2, provided η (t) < 1 2 √ d ∇fi(w (t) ) , we have ∇f (w (t) ) T ∞ k=2 1 k! d i1,i2...i k =1 (-η) k ∂ l1,...l k φ -1 j (φ(w (t) ))(∇f it (w (t) ) l1 . . . ∇f it (w (t) ) l k ) ≤ η (t) µ 2α u L ∇f (w (t) ) ∇f it (w (t) ) . To remove the dependence of η (t) on i t , we take η t) ). Thus, we can take (t) < 1 2 √ d maxi ∇fi(w (t) ) . Since f is µ-PL * and f i is non-negative for all i ∈ [n], ∇f i (w (t) ≤ 2Lf i (w ( η (t) < 1 2 √ 2dLn f (w (t) ) ≤ 1 2 √ d max i ∇f i (w (t) ) This implies the following bounds: ∇f (w (t) ), w (t+1) -w (t) ≤ -η (t) ∇f (w (t) ) T J -1 φ (w (t) )∇f it (w (t) ) + η (t) µ 2α u L ∇f (w (t) ) ∇f it (w (t) ) , w (t+1) -w (t) 2 ≤ η (t) 2 α 2 l + η (t) 2 4α 2 u ∇f it (w (t) ) 2 . Theorem. Suppose φ : R d → R d is an invertible, α u -Lipschitz function and that f : R d → R is non-negative, L-smooth, and µ-PL * on B = {x ; φ(x) ∈ B(φ(w (0) ), R)} with R = 2 √ 2L √ f (w (0) )α 2 u α l µ . If for all x, y ∈ R d there exists α l > 0 such that φ(x) -φ(y), x -y ≥ α l x -y 2 , then, (1) There exists a global minimum w (∞) ∈ B. (2) GMD converges linearly to w (∞) for η = α l L . ( ) If w * = arg min w∈ B ; f (w)=0 φ(w) -φ(w (0) ) then, φ(w * ) -φ(w (∞) ) ≤ 2R. Proof. The proof follows from the proofs of Lemma 1, Theorem 1, and Theorem 4.2 from Liu et al. (2020) . Namely, we will proceed by strong induction. Let κ = Lαu 2 µα l 2 . At timestep 0, we trivially have that w (0) ∈ B and f (w (0) ) ≤ f (w (0) ). At timestep t, we assume that w (0) , w (1) , . . . w (t) ∈ B and that f (w (i) ) ≤ (1 -κ -1 )f (w (i-1) ) for i ∈ [t]. Then at timestep t + 1, from the proofs of Lemma 1 and Theorem 1, we have: f (w (t+1) ) ≤ (1 -κ -1 )f (w (t) ) Next, we need to show that w (t+1) ∈ B. We have that: φ(w (t+1) ) -φ(w (0) ) = f (w (t) ) -f (w (t+1) ) (7) t) , implies that (1 -κ (t) ) < 1 -c * for all timesteps t. Thus, we have that: ≤ η 2 Lα 2 u α 2 l t i=0 f (w (t) ) ≤ η √ 2L α u α l t i=0 (1 -κ -1 ) i f (w (0) ) = η 2Lf (w (0) ) α u α l t i=0 (1 -κ -1 ) i 2 ≤ η 2Lf (w (0) ) α u α l 1 1 - √ 1 -κ -1 ≤ η 2Lf (w (0) ) α u α l 2 κ -1 = α l L 2Lf (w (0) ) α u α l 2 α u L α l µ = 2 √ 2L f (w (0) )α 2 u α l µ = R c * = min c -, min t∈{0,1,...N } κ ( ∞ i=0 (1 -κ (i) ) < ∞ i=0 (1 -c * ) = 0 Thus, Adagrad converges linearly to a global minimum. We present Corollary 2 below. Corollary 2. Let f : R d → R be an L-smooth function that is µ-PL. Let α Proof. By definition of G (t) , we have that: (1) α (t) l 2 = min i∈[d] G (t) i,i (2) α (t) u 2 = max i∈[d] G (t) i,i In particular, we can choose α l = α (0) l uniformly. We need to now ensure that α (t) u does not diverge. We prove this by using strong induction to show that α leads to convergence for Adagrad in the noisy linear regression setting (60 examples in 50 dimensions with uniform noise applied to the labels). (a) 1 hidden layer network with Leaky ReLU activation Xu et al. (2015) and 100 hidden units. (b) 1 hidden layer network with x + sin(x) activation with 100 hidden units. All networks were trained using a single Titan Xp, but can be trained on a laptop as well.



We also provide a fixed learning rate for monotonically decreasing gradients ∇f (w(t) ). The framework also allows for adaptive learning rates by using η(t) to denote a time-dependent step size. We assume all norms are the 2-norm unless stated otherwise. The main difficulty is relating w (t+1) -w(t) to the gradient at timestep t. We require additional assumptions on φ(t) for the case of time-dependent mirrors (see Appendix G.) We take µ as the smallest non-zero eigenvalue since Adagrad updates keep parameters in the span of the data.



the condition number introduced in Definition 4.1 of Liu et al. (2020) for gradient descent. Provided that κ = lim t→∞ κ (t) > 0, then Theorem 1 guarantees linear convergence to a global minimum. When

Suppose φ : R d → R d is an invertible, α u -Lipschitz function and that f : R d → R is non-negative, L-smooth, and µ-PL * on B = {x ; φ(x) ∈ B(φ(w

Figure 1: Using the rates provided by Corollary 1 leads to linear convergence for (Stochastic) Adagrad in the noiseless linear regression setting also considered in Xie et al. (2020). (a, b) Noiseless linear regression on 2000 examples in 20 dimensions. (c, d) Noiseless linear regression on 200 examples in 1000 dimensions.

w (0) )-f (w * )) > L µ .

S uniformly for some S > 0. The base case holds by Lemma 2 since we have: i ∈ {0, 1, . . . t -1}. Then we have:(w (i) ) -f (w * )) by Lemma 2 ≤ 2L(f (w (0) ) -f (w * )) f (w (0) ) -f (w * )) LS µα (0) l 2 < Sby assumption Hence, by induction, α (t) u is bounded uniformly for all timesteps t. Convergence of Adagrad in Over-parameterized Neural Networks 1 Hidden Layer, Leaky ReLU Activation 1 Hidden Layer, x + sin(x) Activation

Figure2: Using the adaptive rate provided by Corollary 1 with L approximated by L (t) = .99 ∇f (w (t) 2 2f (w (t) )

availability

Code for all experiments is available at: https://anonymous.4open.science/r/cef30260-473d-4116-bda1-1debdcc4e00a/ 

annex

Hence, w (t) converges linearly when:To show that the left hand side is true, we analyze when the discriminant is negative. Namely, we have that the left side holds if:Since µ < L by Lemma 3, this is always true. The right hand side holds when η < 4α 2 l 5Lαu , which holds by the assumption of the theorem, thereby completing the proof.Note that if f is non-negative and µ-PL * , then we have:Hence, we can use a fixed learning rate of η = minin this setting.

E CONDITIONS FOR MONOTONICALLY DECREASING GRADIENTS

As discussed in the remarks after Theorem 2, we can provide a fixed learning rate for linear convergence provided that the gradients are monotonically decreasing. As we show below, this requires special conditions on the PL constant, µ, and the smoothness constant, L, for f . Proposition 1. Suppose f : R d → R is L-smooth and µ-PL and φ : R d → R d is an infinitely differentiable, analytic function with analytic inverse, φ -1 . If there exist α l , α u > 0 such that:then generalized mirror descent converges linearly for any η < min. We follow exactly the proof of Theorem 2 except that at each timestep we need C < µ L (which is less than 1 by Lemma 3) in order for the gradients to converge monotonically since:Hence in order for ∇f (w (t+1) ) 2 < ∇f (w (t) ) 2 , we need C < µ L . Thus, we select our learning rate such that:Putting the bounds together we obtain:Now taking expectation over i t , we obtainwhere the second inequality follows from Jensen's inequality and the third inequality follows from Lemma 2. Hence, we have:Then taking expectation with respect to i t , i t-1 , . . . i 1 , yieldsHence, we can proceed inductively to conclude thatThus if 0 < 1 + C < 1, we establish linear convergence. The left hand side is satisfied since µ < L, and the right hand side is satisfied for η (t) < 4µα 2 l 5L 2 αu , which holds by the theorem's assumption, thereby completing the proof.

G PROOF OF THEOREM 4

We restate the theorem below.The identity in (7) follows from the proof of f (w (t+1) ) ≤ (1 -κ -1 )f (w (t) ). Namely,Hence we conclude that w (t+1) ∈ B and so induction is complete.In the case that φ (t) is time-dependent, we establish a similar convergence result by assuming thatu has a uniform upper bound and α (t) l has a uniform lower bound, then:Hence we would conclude that φ (t) (w (t+1) ) ∈ B(φ (0) (w (0) ), R + δ).

H PROOF OF COROLLARY 1 AND COROLLARY 2

We repeat Corollary 1 below.Corollary. Let f : R d → R be an L-smooth function that is µ-PL. Let αi,i and α= 0, then Adagrad converges linearly for adaptive step sizeProof. By definition of G (t) , we have that:(1) αFrom the proof of Theorem 1, using learning rate2 . Although we have that (1 -κ (t) ) < 1 for all t, we need to ensure that ∞ i=0 (1 -κ (i) ) = 0 (otherwise we would not get convergence to a global minimum). Using the assumption that lim t→∞ αThen using the definition of the limit, for 0 < < c, there exists N such that for t > N , κ (t) -c < . Hence, letting I PROOF OF COROLLARY 3 We present the corollary below.Corollary 3. Suppose ψ is an α l -strongly convex function and that ∇ψ is α u -Lipschitz. Let D ψ (x, y) = ψ(x) -ψ(y) -∇ψ(y) T (x -y) denote the Bregman divergence for x, y ∈ R d . If f : R d → R is non-negative, L-smooth, and µ-PL * on B = {x ; ∇ψ(x) ∈ B(∇ψ(w (0) ), R)} with, then:(1) There exists a global minimum w (∞) ∈ B such that D ψ (w (∞) , w (0) ) ≤ R 2 2α l .(2) Mirror descent with potential ψ converges linearly to w (∞) for η = α l L .(Proof. The proof of existence and linear convergence follow immediately from Theorem 4. All that remains is to show that D ψ (w (∞) , w (0) ) ≤ R 2 2µ . As ψ is α l -strongly convex, we have: 0) ). Hence D ψ (w * , w (0) ) < R 2 2α l by definition. Then we have:J EXPERIMENTS ON OVER-PARAMETERIZED NEURAL NETWORKS Below, we present experiments in which we apply the learning rate given by Corollary 1 to overparameterized neural networks. Since the main difficulty is estimating the parameter L in neural networks, we instead provide a crude approximation for L by setting L (t) = .99 ∇f (w (t) ) 2 2f (w (t) ) . The intuition for this approximation comes from Lemma 2. While there are no guarantees that this approximation yields linear convergence according to our theory, Figure 2 suggests empirically that this approximation provides convergence. Moreover, this approximation allows us to compute our adaptive learning rate in practice.

