MINIBATCH STOCHASTIC THREE POINTS METHOD FOR UNCONSTRAINED SMOOTH MINIMIZATION

Abstract

In this paper, we propose a new zero order optimization method called minibatch stochastic three points (MiSTP) method to solve an unconstrained minimization problem in a setting where only an approximation of the objective function evaluation is possible. It is based on the recently proposed stochastic three points (STP) method (Bergou et al., 2020). At each iteration, MiSTP generates a random search direction in a similar manner to STP, but chooses the next iterate based solely on the approximation of the objective function rather than its exact evaluations. We also analyze our method's complexity in the nonconvex and convex cases and evaluate its performance on multiple machine learning tasks.

1. INTRODUCTION

In this paper we consider the following unconstrained finite-sum optimization problem: min x∈R d f (x) def = 1 n n i=1 f i (x) where each f i : R d → R is a smooth objective function. Such kind of problems arise in a large body of machine learning (ML) applications including logistic regression (Conroy & Sajda, 2012) , ridge regression (Shen et al., 2013) , least squares problems (Suykens & Vandewalle, 1999) , and deep neural networks training. The formulation (1) can express the distributed optimization problem across n agents, where each function f i represents the objective function of agent i, or the optimization problem where each f i is the objective function associated with the data point i. We assume that we work in the Zero Order (ZO) optimization settings, i.e., we do not have access to the derivatives of any function f i and only functions evaluations are available. Such situation arises in many fields and may occur due to multiple reasons, for example: (i) In many optimization problems, there is only availability of the objective function as the output of a black-box or simulation oracle and hence the absence of derivative information (Conn et al., 2009) . (ii) There are situations where the objective function evaluation is done through an old software. Modification of this software to provide firstorder derivatives may be too costly or impossible (Conn et al., 2009; Nesterov & Spokoiny, 2017) . (iii) In some situations, derivatives of the objective function are not available but can be extracted. This necessitates access and a good understanding of the simulation code. This process is considered invasive to the simulation code and also very costly in terms of coding efforts (Kramer et al., 2011) . (IV) In the case of using a commercial software that evaluates only the functions, it is impossible to compute the derivatives because the simulation code is inaccessible (Kramer et al., 2011; Conn et al., 2009) . (V) In the case of having access only to noisy function evaluations, computing derivatives is useless because they are unreliable (Conn et al., 2009) . ZO optimization has been used in many ML applications, for instance: hyperparameters tuning of ML models (Turner et al., 2021; P.Koch et al., 2018) , multi-agent target tracking (Al-Abri et al., 2021) , policy optimization in reinforcement learning algorithms (Malik et al., 2020; Li et al., 2020)  x = arg min{f (x -αs), f (x + αs), f (x)} where α > 0 is the stepsize. STP is simple, very easy to implement, and has better complexity bounds than deterministic direct search (DDS) methods. Due to its efficiency and simplicity, STP paved the way for other interesting works that are conducted for the first time, namely the first work on importance sampling in the random direct search setting ( STP IS method) (Bibi et al., 2020) and the first ZO method with heavy ball momentum (SMTP) and with importance sampling (SMTP IS ) (Gorbunov et al., 2020) . To solve problem (1), STP evaluates f two times at each iteration, which means performing two new computations using all the training data for one update of the parameters. In fact, proceeding in such manner is not all the time efficient. In cases when the total number of training samples is extremely large, such as in the case of large scale machine learning, it becomes computationally expensive to use all the dataset at each iteration of the algorithm. Moreover, training an algorithm using minibatches of the data could be as efficient or better than using the full batch as in the case of SGD (Gower et al., 2019) . Motivated by this, we introduced MiSTP to extend STP to the case of using subsets of the data at each iteration of the training process. We consider in this paper the finite-sum problem as it is largely encountered in ML applications, but our approach is applicable to the more general case where we do not have necessarily the finite-sum structure and only an approximation of the objective function can be computed. Such situation may happen, for instance, in the case where the objective function is the output of a stochastic oracle that provides only noisy/stochastic evaluations.

1.1. CONTRIBUTIONS

In this section, we highlight the key contributions of this work. • We propose MiSTP method to extend the STP method (Bergou et al., 2020) to the case of using only an approximation of the objective function at each iteration. • We analyse our method's complexity in the case of nonconvex and convex objective function. • We present experimental results of the performance of MiSTP on multiple ML tasks, namely on ridge regression, regularized logistic regression, and training of a neural network. We evaluate the performance of MiSTP with different minibatch sizes and in comparison with Stochastic Gradient Descent (SGD) (Gower et al., 2019) and other ZO methods.

1.2. OUTLINE

The paper is organized as follow: In section 2 we present our MiSTP method. In section 2.1 we describe the main assumptions on the random search directions which ensure the convergence of our method. These assumptions are the same as the ones used for STP (Bergou et al., 2020) . Then, in section 2.2 we formulate the key lemma for the iteration complexity analysis. In section 3 we analyze the worst case complexity of our method for smooth nonconvex and convex problems. In section 4, we present and discuss our experiments results. In section 4.1, we report the results on ridge regression and regularized logistic regression problems, and in section 4.2, we report the result on neural networks. Finally, we conclude in section 5.



Spokoiny, 2017).Ghadimi & Lan (2013)  proposed a stochastic version of the algorithm proposed by Nesterov & Spokoiny (2017) (called RSGF) in the case of function values being stochastic rather than deterministic.Liu et al. (2018)  also proposed a ZO stochastic variance reduced method (called ZO-SVRG) based on the minibatch variant of SVRG method(Reddi et al., 2016). ZO-SVRG can use different gradient estimators namely RandGradEst, Avg-RandGradEst, and CoordGradEst presented inLiu et al. (2018). Another popular class of ZO methods is Direct-Search (DS) methods. They determine the next iterate based solely on function values and does not develop an approximation of the derivatives or build a surrogate model of the the objective function(Conn et al., 2009). For a comprehensive view about classes of ZO methods we refer the reader to a survey byLarson et al.  (2019). More related to our work,Bergou et al. (2020)  proposed a ZO method called Stochastic Three Points (STP) which is a general variant of direct search methods. At each training iteration, STP generates a random search direction s according to a certain probability distribution and updates the iterate as follow:

