LINEAR CONVERGENCE AND IMPLICIT REGULAR-IZATION OF GENERALIZED MIRROR DESCENT WITH TIME-DEPENDENT MIRRORS Anonymous

Abstract

The following questions are fundamental to understanding the properties of overparameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum? (2) What form of implicit regularization occurs through training? While significant progress has been made in answering both of these questions for gradient descent, they have yet to be answered more completely for general optimization methods. In this work, we establish sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a simple analysis under which non-stochastic GMD converges linearly to a global minimum. We then present a novel, Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain approximate implicit regularization results for GMD by proving that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in 2 -norm in the dual space. Recently, Xie et al. (2020) established linear convergence for a norm version of Adagrad (Adagrad-Norm) using the PL inequality, while Wu et al. (2019) established linear convergence for Adagrad-Norm in the particular setting of over-parameterized neural networks with one hidden layer. An alternate analysis for Adagrad-Norm for smooth, non-convex functions was presented in Ward et al. (2019), resulting in a sub-linear convergence rate.

1. INTRODUCTION

Recent work has established the optimization and generalization benefits of over-parameterization in machine learning (Belkin et al., 2019; Liu et al., 2020; Zhang et al., 2017) . In particular, several works including Vaswani et al. (2019) ; Du et al. (2018) ; Liu et al. (2020) ; Li & Liang (2018) have demonstrated that over-parameterized models converge to a global minimum when trained using stochastic gradient descent and that such convergence can occur at a linear rate. Independently, other work, such as Gunasekar et al. (2018) , have characterized implicit regularization of overparameterized models, i.e., the properties of the solution selected by a given optimization method, without proving convergence. Recently, Azizan & Hassibi (2019) ; Azizan et al. (2019) simultaneously proved convergence and analyzed approximate implicit regularization for mirror descent (Beck & Teboulle, 2003; Nemirovsky & Yudin, 1983) . In particular, by using the fundamental identity of stochastic mirror descent (SMD), they proved that SMD converges to an interpolating solution that is approximately the closest one to the initialization in Bregman divergence. However, these works do not provide a rate of convergence for SMD and assume that there exists an interpolating solution within in Bregman divergence from the initialization. In this work, we provide sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), an extension of mirror descent that introduces (1) a potential-free update rule and (2) a time-dependent mirror; namely, GMD with invertible φ : R d → R d and learning rate η is used to minimize a real valued loss function, f , according to the update rule: φ (t) (w (t+1) ) = φ (t) (w (t) ) -η∇f (w (t) ). (1) We discuss the stochastic version of GMD (SGMD) in Section 3. GMD generalizes both mirror descent and preconditioning methods. Namely, if for all t, φ (t) = ∇ψ for some strictly convex function ψ, then GMD corresponds to mirror descent with potential ψ; if φ (t) = G (t) for some invertible matrix G (t) ∈ R d×d , then the update rule in equation ( 1) reduces to w (t+1) = w (t) -ηG (t) -1 ∇f (w (t) ) and hence represents applying a pre-conditioner to gradient updates. The following is a summary of our results: 1. We provide a simple proof for linear convergence of GMD under the Polyak-Lojasiewicz inequality (Theorem 1). 2. We provide sufficient conditions under which SGMD converges linearly under an adaptive learning rate (Theorems 2 and 3)foot_0 . 3. As corollaries to Theorems 1 and 3, in Section 5 we provide sufficient conditions for linear convergence of stochastic mirror descent as well as stochastic preconditioner methods such as Adagrad (Duchi et al., 2011) . 4. We prove the existence of an interpolating solution and linear convergence of GMD to this solution for non-negative loss functions that locally satisfy the PL * inequality (Liu et al., 2020) . This result (Theorem 4) provides approximate implicit regularization results for GMD: GMD converges linearly to an interpolating solution that is approximately the closest interpolating solution to the initialization in 2 norm in the dual space induced by φ (t) . 2019) also used the PL inequality to establish a local linear convergence result for gradient descent on 1 hiddden layer over-parameterized neural networks.



We also provide a fixed learning rate for monotonically decreasing gradients ∇f (w(t) ).



Recent work(Azizan et al., 2019)  established convergence of stochastic mirror descent (SMD) for nonlinear optimization problems. It characterized the implicit bias of mirror descent by demonstrating that SMD converges to a global minimum that is within epsilon of the closest interpolating solution in Bregman divergence. The analysis inAzizan et al. (2019)  relies on the fundamental identity of SMD and does not provide explicit learning rates or establish a rate of convergence for SMD in the nonlinear setting. The work in Azizan & Hassibi (2019) provided explicit learning rates for the convergence of SMD in the linear setting under strongly convex potential, again without a rate of convergence. While these works established convergence of SMD, prior work byGunasekar  et al. (2018)  analyzed the implicit bias of SMD without proving convergence.A potential-based version of generalized mirror descent with time-varying regularizes was presented for online problems inOrabona et al. (2015). That work is primarily concerned with establishing regret bounds for the online learning setting, which differs from our setting of minimizing a loss function given a set of known data points. A potential-free formulation of GMD for the flow was presented inGunasekar et al. (2020).The Polyak-Lojasiewicz (PL) inequality(Lojasiewicz, 1963; Polyak, 1963)  serves as a simple condition for linear convergence in non-convex optimization problems and is satisfied in a number of settings including over-parameterized neural networks(Liu et al., 2020). Work by Karimi et al. (2016) demonstrated linear convergence of a number of descent methods (including gradient descent) under the PL inequality. Similarly, Vaswani et al. (2019) proved linear convergence of stochastic gradient descent (SGD) under the PL inequality and the strong growth condition (SGC), and Bassily et al. (2018) established the same rate for SGD under just the PL inequality. Soltanolkotabi et al. (

