STOCHASTIC PROXIMAL POINT ALGORITHM FOR LARGE-SCALE NONCONVEX OPTIMIZATION: CONVERGENCE, IMPLEMENTATION, AND APPLICA-TION TO NEURAL NETWORKS

Abstract

We revisit the stochastic proximal point algorithm (SPPA) for large-scale nonconvex optimization problems. SPPA has been shown to converge faster and more stable than the celebrated stochastic gradient descent (SGD) algorithm, and its many variations, for convex problems. However, the per-iteration update of SPPA is defined abstractly and has long been considered expensive. In this paper, we show that efficient implementation of SPPA can be achieved. If the problem is a nonlinear least squares, each iteration of SPPA can be efficiently implemented by Gauss-Newton; with some linear algebra trick the resulting complexity is in the same order of SGD. For more generic problems, SPPA can still be implemented with L-BFGS or accelerated gradient with high efficiency. Another contribution of this work is the convergence of SPPA to a stationary point in expectation for nonconvex problems. The result is encouraging that it admits more flexible choices of the step sizes under similar assumptions. The proposed algorithm is elaborated for both regression and classification problems using different neural network structures. Real data experiments showcase its effectiveness in terms of convergence and accuracy compared to SGD and its variants.

1. INTRODUCTION

Algorithm design for large-scale machine learning problems have been dominated by the stochastic (sub)gradient descent (SGD) and its variants (Bottou et al., 2018) . The main reasons are two-fold: on the one hand, the size of the data set may be so large that obtaining the full gradient information is too costly; on the other hand, solving the formulated problem to very high accuracy is typically unnecessary in machine learning, since the ultimate goal of most tasks is not to fit the training data but to generalize well on unseen data. As a result, stochastic algorithms such as SGD has gained tremendous popularity recently. There has been many variations and extensions of the plain vanilla SGD algorithm to accelerate its convergence rate. One line of research focuses on reducing the variance of the stochastic gradient, resulting in famous algorithms such as SVRG (Johnson and Zhang, 2013) and SAGA (Defazio et al., 2014) , which results in extra time/memory complexities of the algorithm (significantly). More recently, adaptive learning schemes such as AdaGrad (Duchi et al., 2011) and Adam (Diederik P. Kingma, 2014) have shown to be more effective in keeping the algorithm fully stochastic and light-weight. In terms of theory, there has also been surging amount of work quantifying the best possible rate using first-order information (Lei et al., 2017; Allen-Zhu, 2017; 2018a; b) , as well as its ability to obtain not only stationary points but also local optima (Ge et al., 2015; Jin et al., 2017; Xu et al., 2018; Allen-Zhu, 2018) .

1.1. STOCHASTIC PROXIMAL POINT ALGORITHM (SPPA)

In this work, we consider a different type of stochastic algorithm called the stochastic proximal point algorithm (SPPA), also known as incremental proximal point method (Bertsekas, 2011a; b) or stochastic proximal iterations (Ryu and Boyd, 2014) . Consider the following optimization problem

