CONVERGENCE ANALYSIS OF HOMOTOPY-SGD FOR NON-CONVEX OPTIMIZATION

Abstract

First-order stochastic methods for solving large-scale non-convex optimization problems are widely used in many big-data applications, e.g. training deep neural networks as well as other complex and potentially non-convex machine learning models. Their inexpensive iterations generally come together with slow global convergence rate (mostly sublinear), leading to the necessity of carrying out a very high number of iterations before the iterates reach a neighborhood of a minimizer. In this work, we present a first-order stochastic algorithm based on a combination of homotopy methods and SGD, called Homotopy-Stochastic Gradient Descent (H-SGD), which finds interesting connections with some proposed heuristics in the literature, e.g. optimization by Gaussian continuation, training by diffusion, mollifying networks. Under some mild assumptions on the problem structure, we conduct a theoretical analysis of the proposed algorithm. Our analysis shows that, with a specifically designed scheme for the homotopy parameter, H-SGD enjoys a global linear rate of convergence to a neighborhood of a minimum while maintaining fast and inexpensive iterations. Experimental evaluations confirm the theoretical results and show that H-SGD can outperform standard SGD.

1. INTRODUCTION

This paper focuses on the theoretical development and analysis of a stochastic optimization algorithm, called Homotopy-Stochastic Gradient Descent (H-SGD), based on the combination of homotopy methods and stochastic gradient descent (SGD) . The algorithm we propose is specifically designed to solve finite-sum problems of the following form w * ∈ arg min w∈R d    f (w) := 1 N N j=1 f j (w)    , where f : R d → R is continuously differentiable, bounded below and not necessarily convex. In particular, we assume that we only have access to noisy function values and gradients of the objective function in equation 1 via a stochastic first-order oracle, as in (Nemirovski et al., 2009) and (Ghadimi & Lan, 2013) . Problems of this form typically arise in machine learning and deep learning applications, where the dimensionality of the datasets makes the full function and gradient evaluations too expensive. This class of problems is generally approximately solved by stochastic first-order iterative algorithms, e.g. SGD (Bottou et al., 2018 ), Adagrad (Duchi et al., 2011 ), Adam (Kingma & Ba, 2015) . At the iteration t, the algorithms of this class acquire a stochastic estimate of the function value f (w t , ξ t ) and the gradient g(w t , ξ t ) by calling the oracle with input w t , where ξ t is a random variable, i.e. when the noise comes from subsampling as in the mini-batch scenarios, then ξ t ∈ {0, 1} N with ξ t 1 = M and g(w t , ξ t ) = 1 M N j=1 ξ t,j • ∇f j (w t ). In the case of SGD, for a given w 0 ∈ R d and α > 0, the iterates are generated as follows w t+1 := w t -αg(w t , ξ t ) . (2) Consequently, the iterate w t+1 = w t+1 (ξ [t] ) is a function of the history ξ [t] := (ξ 0 , . . . , ξ t ) (also w 0 should be included in case it is a random initial point) of the generated random process and hence is itself random. In general, stochastic first-order methods enjoy fast convergence when the problem is characterized by a certain structure. In particular, when the Polyak-Łojasiewicz (PL) condition (see Karimi et al., 2016 , for more details on the PL condition) holds for the objective function in Problem 1, then, with a "small enough" value for the step-size, the SGD iterates converge linearly to a minimizer's neighborhood (Karimi et al., 2016; Vaswani et al., 2019) . Unfortunately, in many machine learning applications, the PL condition is not a realistic assumption as the landscape is generally characterized by the presence of multiple local minima and saddle points (Dauphin et al., 2014; Kawaguchi, 2016; Karimi et al., 2017) . At the same time, in the vicinity of the minimizers the problems generally show stronger structures, i.e. PL or even strong convexity, allowing for a faster local convergence (Karimi et al., 2017) . In such a scenario, a smart initialization hence becomes crucial for the numerical performance of the method (Sutskever et al., 2013b) . Unfortunately, the power of the existing smart initialization heuristics is often quite limited given the small knowledge of the problem's landscape which we generally dispose of. In addition, these heuristics typically can not guarantee that the SGD iterates start "close enough" to a minimizer, i.e. in the region where the PL condition holds, such that the method enjoys a linear rate of convergence. Therefore, the ideal scenario would be to be able to exploit the stronger local structure while the method's iterates gradually approach a minimizer and independently from the starting point. In this regard, homotopy methods are a general strategy for tackling difficult optimization problems by gradually transforming a simplified version of a target problem, or a version with a known minimizer, back to its original form while following a solution along the way. Consequently, they preserve in each step the vicinity to a minimizer of the currently tackled problem, allowing the solver to always work in regions where the problems exhibit stronger structures. In general terms, homotopy methods (Allgower & Georg, 2003) are a widely and successfully used mathematical tool to efficiently solve various problems in numerical analysis, e.g. (Deuflhard, 2011 ), (Liao, 2012) . Such methods are also suitable to solve complex non-convex optimization problems where no or only little prior knowledge regarding the localization of the solutions is available, allowing for the exploitation of the stronger local structures of the problems in order to achieve fast global convergence, e.g. (Xiao & Zhang, 2012; Lin & Xiao, 2014; Suzumura et al., 2014; Gargiani et al., 2020) . In this work, we propose a stochastic first-order numerical method to solve Problem 1, called Homotopy-Stochastic Gradient Descent (H-SGD), which is based on the combination of the homotopy method and SGD. After introducing the method and discussing the related work (Section 2), our contributions are as follows 1. In Section 3, we provide a general theoretical analysis of H-SGD under some mild assumptions, showing that, if the increments in the homotopy parameter are "small enough", the proposed method tracks in expectation an r-optimal solution across homotopy iterations. We then show that, in the same setting, H-SGD can achieve a global linear rate of convergence to a minimizer's neighborhood when used in combination with a specific schedule for the homotopy parameter, i.e. ∆λ i decreases exponentially across homotopy iterations. 2. In Section 4, we empirically evaluate the performance of H-SGD. Our experiments not only confirm the theoretical results derived in Section 3 but also show that H-SGD with a smartly designed homotopy map can outperform SGD.

2. HOMOTOPY-SGD

Homotopy-Stochastic Gradient Descent (H-SGD) is based on the combination of the homotopy method and SGD, in the hope of combining the best of both worlds. In particular, the goal is that of maintaining the advantageous properties of SGD, such as its cheap iterations and fast local convergence under PL condition, while maximally exploiting the stronger local structures via the homotopy scheme. Therefore, H-SGD relies on the definition of a homotopy map f (w, λ) : R d × [0, 1] → R, such that, when λ = 0 we recover a well-behaved function, e.g. convex, or a function with a known minimizer's localization, and by increasing the λ parameter, also called homotopy parameter, we gradually morph it in order to finally end up with our target objective function f (w, 1) = f (w) (see Suciu, 2016, for more details on homotopy functions). By using such a homotopy map, H-SGD finds an approximate solution of Problem 1 by approximately solving a series of parametric problems that gradually leads to the target one. In particular, in each homotopy

