STOCHASTIC PROXIMAL POINT ALGORITHM FOR LARGE-SCALE NONCONVEX OPTIMIZATION: CONVERGENCE, IMPLEMENTATION, AND APPLICA-TION TO NEURAL NETWORKS

Abstract

We revisit the stochastic proximal point algorithm (SPPA) for large-scale nonconvex optimization problems. SPPA has been shown to converge faster and more stable than the celebrated stochastic gradient descent (SGD) algorithm, and its many variations, for convex problems. However, the per-iteration update of SPPA is defined abstractly and has long been considered expensive. In this paper, we show that efficient implementation of SPPA can be achieved. If the problem is a nonlinear least squares, each iteration of SPPA can be efficiently implemented by Gauss-Newton; with some linear algebra trick the resulting complexity is in the same order of SGD. For more generic problems, SPPA can still be implemented with L-BFGS or accelerated gradient with high efficiency. Another contribution of this work is the convergence of SPPA to a stationary point in expectation for nonconvex problems. The result is encouraging that it admits more flexible choices of the step sizes under similar assumptions. The proposed algorithm is elaborated for both regression and classification problems using different neural network structures. Real data experiments showcase its effectiveness in terms of convergence and accuracy compared to SGD and its variants.

1. INTRODUCTION

Algorithm design for large-scale machine learning problems have been dominated by the stochastic (sub)gradient descent (SGD) and its variants (Bottou et al., 2018) . The main reasons are two-fold: on the one hand, the size of the data set may be so large that obtaining the full gradient information is too costly; on the other hand, solving the formulated problem to very high accuracy is typically unnecessary in machine learning, since the ultimate goal of most tasks is not to fit the training data but to generalize well on unseen data. As a result, stochastic algorithms such as SGD has gained tremendous popularity recently. There has been many variations and extensions of the plain vanilla SGD algorithm to accelerate its convergence rate. One line of research focuses on reducing the variance of the stochastic gradient, resulting in famous algorithms such as SVRG (Johnson and Zhang, 2013) and SAGA (Defazio et al., 2014) , which results in extra time/memory complexities of the algorithm (significantly). More recently, adaptive learning schemes such as AdaGrad (Duchi et al., 2011) and Adam (Diederik P. Kingma, 2014) have shown to be more effective in keeping the algorithm fully stochastic and light-weight. In terms of theory, there has also been surging amount of work quantifying the best possible rate using first-order information (Lei et al., 2017; Allen-Zhu, 2017; 2018a; b) , as well as its ability to obtain not only stationary points but also local optima (Ge et al., 2015; Jin et al., 2017; Xu et al., 2018; Allen-Zhu, 2018) .

1.1. STOCHASTIC PROXIMAL POINT ALGORITHM (SPPA)

In this work, we consider a different type of stochastic algorithm called the stochastic proximal point algorithm (SPPA), also known as incremental proximal point method (Bertsekas, 2011a; b) or stochastic proximal iterations (Ryu and Boyd, 2014) . Consider the following optimization problem with the objective function in the form of a finite sum of component functions minimize ∈R 1 =1 ℓ ( ) = ( ). (1) SPPA takes the following simple form: 1: repeat 2: randomly draw from {1, . . . , } 3: +1 ← arg min ℓ ( ) + (1/2) - 2 = Prox ℓ ( ) 4: until convergence The update rule in line 3 is called the proximal operator of the function ℓ evaluated at . This is the stochastic version of the proximal point algorithm, which dates back to Rockafellar (1976) . Admittedly, SPPA is not as universally applicable as SGD, due to the abstraction of the per-iteration update rule. It is also asking for more information from the problem than merely the first-order derivatives. However, with the help of more information inquired, there is also hope that it provides faster and more robust convergence guarantees. As we will see in numerical experiments, SPPA is able to achieve good optimization performance by taking fewer number of passes through the data set, although it takes a little more computations for each batch. We believe in many cases it is worth trading off more computations for fewer memory accesses. To the best of our knowledge, convergence analyses of SPPA has only been studied for convex problems (Bertsekas, 2011a; Ryu and Boyd, 2014; Bianchi, 2016) . Their study shows that SPPA converges somewhat similar to SGD for convex problems, but the updates are much more robust to instabilities in the problem. Most authors also accept the premise that the proximal operator is sometimes difficult to evaluate, and thus proposed variations to the plain vanilla version to handle more complicated problem structures (Wang and Bertsekas, 2013; Duchi and Ruan, 2018; Asi and Duchi, 2019b; Davis and Drusvyatskiy, 2019) . In terms of nonconvex optimization problems, there is very little work until very recently (Davis and Drusvyatskiy, 2019; Asi and Duchi, 2019a) . However, their convergence analysis is somewhat unconventional. Typically for a nonconvex problem, we would expect a theoretical claim that the iterates generated by SPPA converges (in expectation) to a stationary point. This is not easy, and the result given by (Davis and Drusvyatskiy, 2019) and (Asi and Duchi, 2019a) defined an imaginary sequence (that is not computed in practice) { } as = arg min ( ) + (1/2) -2 , i.e., the proximal operator of the full loss function from the algorithm sequence { }. Their results show that this imaginary sequence { } converges to a stationary point in expectation.

1.2. CONTRIBUTIONS

There are two main contributions we present in this paper; efficient implementations of proximal operator update for SPPA with application on regression and classification problems, and convergence to a stationary point for SPPA algorithm for general non-convex problems. The cost of this abstract per-iteration update rule has been the burden for SPPA algorithm, even though it has been shown to converge faster and more stable than the celebrated SGD algorithm. In this paper, we show that it is actually not a burden when the per-iteration update is efficiently implemented. In the implementation section we will discuss the implementation of abstract per-iteration update rule for non-linear least squares (NLS) problem and other non-linear problems. We present two different implementations for these two different categories of problems. SPPA-Gauss Newton (SPPA-GN) and SPPA-Accelerated, respectively for regression with nonlinear least squares and classification problems. We apply SPPA to a large family of nonconvex optimization problems and show that the seemingly complicated proximal operator update can still be efficiently obtained. Both implementations give results that are comparable with the state of the art stochastic algorithms. On the other hand, there is large group of nonlinear problems that are not necessarily expressed as a NLS problem (non-NLS). Many of the classification problems in deep learning applications as of today, use different type of loss functions than mean least squared error. Hence, we suggest an

