LINEAR CONVERGENCE AND IMPLICIT REGULAR-IZATION OF GENERALIZED MIRROR DESCENT WITH TIME-DEPENDENT MIRRORS Anonymous

Abstract

The following questions are fundamental to understanding the properties of overparameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum? (2) What form of implicit regularization occurs through training? While significant progress has been made in answering both of these questions for gradient descent, they have yet to be answered more completely for general optimization methods. In this work, we establish sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a simple analysis under which non-stochastic GMD converges linearly to a global minimum. We then present a novel, Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain approximate implicit regularization results for GMD by proving that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in 2 -norm in the dual space.

1. INTRODUCTION

Recent work has established the optimization and generalization benefits of over-parameterization in machine learning (Belkin et al., 2019; Liu et al., 2020; Zhang et al., 2017) . In particular, several works including Vaswani et al. (2019); Du et al. (2018); Liu et al. (2020); Li & Liang (2018) have demonstrated that over-parameterized models converge to a global minimum when trained using stochastic gradient descent and that such convergence can occur at a linear rate. Independently, other work, such as Gunasekar et al. ( 2018), have characterized implicit regularization of overparameterized models, i.e., the properties of the solution selected by a given optimization method, without proving convergence. Recently, Azizan & Hassibi (2019); Azizan et al. ( 2019) simultaneously proved convergence and analyzed approximate implicit regularization for mirror descent (Beck & Teboulle, 2003; Nemirovsky & Yudin, 1983) . In particular, by using the fundamental identity of stochastic mirror descent (SMD), they proved that SMD converges to an interpolating solution that is approximately the closest one to the initialization in Bregman divergence. However, these works do not provide a rate of convergence for SMD and assume that there exists an interpolating solution within in Bregman divergence from the initialization. In this work, we provide sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), an extension of mirror descent that introduces (1) a potential-free update rule and (2) a time-dependent mirror; namely, GMD with invertible φ : R d → R d and learning rate η is used to minimize a real valued loss function, f , according to the update rule: φ (t) (w (t+1) ) = φ (t) (w (t) ) -η∇f (w (t) ). (1)

