MINIBATCH STOCHASTIC THREE POINTS METHOD FOR UNCONSTRAINED SMOOTH MINIMIZATION

Abstract

In this paper, we propose a new zero order optimization method called minibatch stochastic three points (MiSTP) method to solve an unconstrained minimization problem in a setting where only an approximation of the objective function evaluation is possible. It is based on the recently proposed stochastic three points (STP) method (Bergou et al., 2020) . At each iteration, MiSTP generates a random search direction in a similar manner to STP, but chooses the next iterate based solely on the approximation of the objective function rather than its exact evaluations. We also analyze our method's complexity in the nonconvex and convex cases and evaluate its performance on multiple machine learning tasks.

1. INTRODUCTION

In this paper we consider the following unconstrained finite-sum optimization problem: min x∈R d f (x) def = 1 n n i=1 f i (x) where each f i : R d → R is a smooth objective function. Such kind of problems arise in a large body of machine learning (ML) applications including logistic regression (Conroy & Sajda, 2012) , ridge regression (Shen et al., 2013) , least squares problems (Suykens & Vandewalle, 1999) , and deep neural networks training. The formulation (1) can express the distributed optimization problem across n agents, where each function f i represents the objective function of agent i, or the optimization problem where each f i is the objective function associated with the data point i. We assume that we work in the Zero Order (ZO) optimization settings, i.e., we do not have access to the derivatives of any function f i and only functions evaluations are available. Such situation arises in many fields and may occur due to multiple reasons, for example: (i) In many optimization problems, there is only availability of the objective function as the output of a black-box or simulation oracle and hence the absence of derivative information (Conn et al., 2009) . (ii) There are situations where the objective function evaluation is done through an old software. Modification of this software to provide firstorder derivatives may be too costly or impossible (Conn et al., 2009; Nesterov & Spokoiny, 2017) . (iii) In some situations, derivatives of the objective function are not available but can be extracted. This necessitates access and a good understanding of the simulation code. This process is considered invasive to the simulation code and also very costly in terms of coding efforts (Kramer et al., 2011) . (IV) In the case of using a commercial software that evaluates only the functions, it is impossible to compute the derivatives because the simulation code is inaccessible (Kramer et al., 2011; Conn et al., 2009) . (V) In the case of having access only to noisy function evaluations, computing derivatives is useless because they are unreliable (Conn et al., 2009) . ZO optimization has been used in many ML applications, for instance: hyperparameters tuning of ML models (Turner et al., 2021; P.Koch et al., 2018) , multi-agent target tracking (Al-Abri et al., 2021) , policy optimization in reinforcement learning algorithms (Malik et al., 2020; Li et al., 2020) , maximization of the area under the curve (AUC) (Ghanbari & Scheinberg, 2017) , automatic speech recognition (Watanabe & Roux, 2014) , and the generation of black-box adversarial attacks on deep neural network classifiers (Ughi et al., 2021) . Google Vizier system (Golovin et al., 2017) which is the de facto parameter tuning engine at Google is also based on ZO optimization. There exist many ZO methods that solve problem (1), most of them approximate the gradient using gradient smoothing techniques such as the popular two-point gradient estimator (Nesterov & Spokoiny, 2017) . Ghadimi & Lan (2013) proposed a stochastic version of the algorithm proposed by Nesterov & Spokoiny (2017) (called RSGF) in the case of function values being stochastic rather than deterministic. Liu et al. (2018) also proposed a ZO stochastic variance reduced method (called ZO-SVRG) based on the minibatch variant of SVRG method (Reddi et al., 2016) . ZO-SVRG can use different gradient estimators namely RandGradEst, Avg-RandGradEst, and CoordGradEst presented in Liu et al. (2018) . Another popular class of ZO methods is Direct-Search (DS) methods. They determine the next iterate based solely on function values and does not develop an approximation of the derivatives or build a surrogate model of the the objective function (Conn et al., 2009) . For a comprehensive view about classes of ZO methods we refer the reader to a survey by Larson et al. (2019) . More related to our work, Bergou et al. (2020) proposed a ZO method called Stochastic Three Points (STP) which is a general variant of direct search methods. At each training iteration, STP generates a random search direction s according to a certain probability distribution and updates the iterate as follow: x = arg min{f (x -αs), f (x + αs), f (x)} where α > 0 is the stepsize. STP is simple, very easy to implement, and has better complexity bounds than deterministic direct search (DDS) methods. Due to its efficiency and simplicity, STP paved the way for other interesting works that are conducted for the first time, namely the first work on importance sampling in the random direct search setting ( STP IS method) (Bibi et al., 2020) and the first ZO method with heavy ball momentum (SMTP) and with importance sampling (SMTP IS ) (Gorbunov et al., 2020) . To solve problem (1), STP evaluates f two times at each iteration, which means performing two new computations using all the training data for one update of the parameters. In fact, proceeding in such manner is not all the time efficient. In cases when the total number of training samples is extremely large, such as in the case of large scale machine learning, it becomes computationally expensive to use all the dataset at each iteration of the algorithm. Moreover, training an algorithm using minibatches of the data could be as efficient or better than using the full batch as in the case of SGD (Gower et al., 2019) . Motivated by this, we introduced MiSTP to extend STP to the case of using subsets of the data at each iteration of the training process. We consider in this paper the finite-sum problem as it is largely encountered in ML applications, but our approach is applicable to the more general case where we do not have necessarily the finite-sum structure and only an approximation of the objective function can be computed. Such situation may happen, for instance, in the case where the objective function is the output of a stochastic oracle that provides only noisy/stochastic evaluations.

1.1. CONTRIBUTIONS

In this section, we highlight the key contributions of this work. • We propose MiSTP method to extend the STP method (Bergou et al., 2020) to the case of using only an approximation of the objective function at each iteration. • We analyse our method's complexity in the case of nonconvex and convex objective function. • We present experimental results of the performance of MiSTP on multiple ML tasks, namely on ridge regression, regularized logistic regression, and training of a neural network. We evaluate the performance of MiSTP with different minibatch sizes and in comparison with Stochastic Gradient Descent (SGD) (Gower et al., 2019) and other ZO methods.

1.2. OUTLINE

The paper is organized as follow: In section 2 we present our MiSTP method. In section 2.1 we describe the main assumptions on the random search directions which ensure the convergence of our method. These assumptions are the same as the ones used for STP (Bergou et al., 2020) . Then, in section 2.2 we formulate the key lemma for the iteration complexity analysis. In section 3 we analyze the worst case complexity of our method for smooth nonconvex and convex problems. In section 4, we present and discuss our experiments results. In section 4.1, we report the results on ridge regression and regularized logistic regression problems, and in section 4.2, we report the result on neural networks. Finally, we conclude in section 5.

1.3. NOTATION

Throughout the paper, D will denote a probability distribution over R d . We use E [•] to denote the expectation, E ξ [•] to denote the expectation over the randomness of ξ conditional to other random quantities, and for two random variables X and Y, E[X|Y ] denotes the expectation of X given Y. ⟨x, y⟩ = x ⊤ y corresponds to the inner product of x and y. We denote also by ∥ • ∥ 2 the ℓ 2 -norm, and by ∥ • ∥ D a norm dependent on D. We denote by f B : f B (x) = 1 |B| i∈B f i (x), where B is a subset on indexes chosen from the set [1, 2, . . . , n] and |B| is its cardinal.

2. MISTP METHOD

Our minibatch stochastic three points (MiSTP) algorithm is formalized below as Algorithm 1. Algorithm 1: Minibatch Stochastic Three Points (MiSTP) Initialization Choose x 0 ∈ R d , positive stepsizes {α k } k≥0 , probability distribution D on R d . For k = 0, 1, 2, . . . 1. Generate a random vector s k ∼ D 2. Choose elements of the subset B k u.a.r 3. Let x + = x k + α k s k and x -= x k -α k s k 4. x k+1 = arg min{f B k (x -), f B k (x + ), f B k (x k )} Due to the randomness of the search directions s k and the minibatches B k for k ≥ 0, the iterates are also random vectors for all k ≥ 1. The starting point x 0 is not random (the initial objective function value f (x 0 ) is deterministic). Lemma 1. For x ∈ R d such that x is independent from B, i.e., the choice of x does not depend on the choice of B, f B (x) is an unbiased estimator of f (x). Proof. See appendix A, section A.1. Throughout the paper, we assume that f i , (for i = 1, . . . , n) is differentiable, and has L i -Lipschitz gradient. We assume also that f is bounded from below. Assumption 1. The objective function f i , (for i = 1, . . . , n) is L i -smooth with L i > 0 and f is bounded from below by f * ∈ R. That is, f i has a Lipschitz continuous gradient with a Lipschitz constant L i : ∥∇f i (x) -∇f i (y)∥ 2 ≤ L i ∥x -y∥ 2 , ∀x, y ∈ R d and f (x) ≥ f * for all x ∈ R d . Assumption 2. We assume that the variance of f B (x) is bounded for all x ∈ R d : E B [(f (x) -f B (x)) 2 ] < σ 2 |B| < ∞ This assumption is very common in the stochastic optimization literature (Larson et al., 2019, section 6) . Note that we put the subscript |B| in σ |B| to mention that this deviation may be dependent on the minibatch size. Consider, for example, the case of sampling minibatches uniformly with replacement. In such case, the expected deviation between f and f B satisfy E B [(f (x) -f B (x)) 2 ] ≤ A |B| for all x ∈ R d independent from B where A = sup x∈R d 1 n n i=1 (f i (x) -f (x)) 2 (See appendix A, section A.2). Note that, given that the function f (y) = y 2 is convex on R and using Jensen's inequality we have: (E B [|f (x) -f B (x)|]) 2 ≤ E B [(|f (x) -f B (x)|) 2 ]. Therefore, E B [|f (x) - f B (x)|] ≤ σ |B| .

2.1. ASSUMPTION ON THE DIRECTIONS

Our analysis in the sequel of the paper will be based on the following key assumption. Assumption 3. The probability distribution D on R d has the following properties: 1. The quantity E s∼D ∥s∥ 2 2 is positive and finite. Without loss of generality, in the rest of this paper we assume that it is equal to 1. 2. There is a constant µ D > 0 and norm ∥ • ∥ D on R d such that for all g ∈ R d , E s∼D | ⟨g, s⟩ | ≥ µ D ∥g∥ D . (3) As proved in the STP paper (Bergou et al., 2020) , multiple distributions satisfy this assumption. For example: the uniform distribution on the unit sphere in R d , the normal distribution with zero mean and d × d identity as the covariance matrix, the uniform distribution over standard unit basis vectors {e 1 , ..., e d } in R d , the distribution on S = s 1 , ..., s d where {s 1 , ..., s d } form an orthonormal basis of R d .

2.2. KEY LEMMA

Now, we establish the key result which will be used to prove the main properties of our algorithm. Lemma 2. If Assumptions 1, 2 , and 3 hold, then for all k ≥ 0, θ k+1 ≤ θ k -µ D α k g k + L 2 α 2 k + σ |B| , where L B k = 1 |B k | i∈B k L i , L = E[L B k ] = 1 n n i=1 L i , θ k = E[f (x k )] and g k = E[∥∇f (x k )∥ D ], and |B k | is the minibatch size . Proof. We have: f (x k+1 ) -f B k (x k+1 ) ≤ |f (x k+1 ) -f B k (x k+1 )| i.e., f (x k+1 ) ≤ f B k (x k+1 ) + |f (x k+1 ) -f B k (x k+1 )| (5) We have: x k+1 = arg min{f B k (x k -α k s k ), f B k (x k + α k s k ), f B k (x k )}, therefore: f B k (x k+1 ) ≤ f B k (x k + α k s k ) (6). From L i -smoothness of f i we have: f i (x k + α k s k ) ≤ f i (x k ) + ⟨∇f i (x k ), α k s k ⟩ + L i 2 ∥α k s k ∥ 2 2 By summing over f i for i ∈ B k and multiplying by 1/|B k | we get: f B k (x k + α k s k ) ≤ f B k (x k ) + ⟨∇f B k (x k ), α k s k ⟩ + L B k 2 ∥α k s k ∥ 2 2 = f B k (x k ) + α k ⟨∇f B k (x k ), s k ⟩ + L B k 2 α 2 k ∥s k ∥ 2 2 (8) By using inequalities ( 5), (6), and (8) we get: f (x k+1 ) ≤ f B k (x k ) + α k ⟨∇f B k (x k ), s k ⟩ + L B k 2 α 2 k ∥s k ∥ 2 2 + e k+1 B k where e k+1 B k = |f (x k+1 ) -f B k (x k+1 )| By taking the expectation conditioned on x k and s k and using assumption 2 we get: E[f (x k+1 )|x k , s k ] ≤ f (x k ) + α k ⟨∇f (x k ), s k ⟩ + L 2 α 2 k ∥s k ∥ 2 2 + σ |B| Similarly, we can get (see details in appendix A, section A.3): E[f (x k+1 )|x k , s k ] ≤ f (x k ) -α k ⟨∇f (x k ), s k ⟩ + L 2 α 2 k ∥s k ∥ 2 2 + σ |B| From the two inequalities above we conclude: E[f (x k+1 )|x k , s k ] ≤ f (x k ) -α k | ⟨∇f (x k ), s k ⟩ | + L 2 α 2 k ∥s k ∥ 2 2 + σ |B| By taking the expectation over s k and using inequality (3) we get: E[f (x k+1 )|x k ] ≤ f (x k ) -α k µ D ∥∇f (x k )∥ D + L 2 α 2 k + σ |B| By taking expectation in the above inequality and due to the tower property of the expectation we get: E[f (x k+1 )] ≤ E[f (x k )] -α k µ D E[∥∇f (x k )∥ D ] + L 2 α 2 k + σ |B|

3. COMPLEXITY ANALYSIS

We first state, in theorem 1, the most general complexity result of MiSTP where we do not make any additional assumptions on the objective functions besides smoothness of f i , for i = 1, . . . , n, and boundedness of f . The proofs follow the same reasoning as the ones in STP (Bergou et al., 2020) , we defer them to the appendix. Theorem 1 (nonconvex case). Let Assumptions 1, 2 , and 3 be satisfied and σ |B| < (µ D ϵ) 2 2L . Choose a fixed stepsize α k = α with (µ D ϵ-(µ D ϵ) 2 -2Lσ |B| )/L < α < (µ D ϵ+ (µ D ϵ) 2 -2Lσ |B| )/L, If K ≥ k(ε) def = f (x 0 ) -f * µ D εα -L 2 α 2 -σ |B| -1, then min k=0,1,...,K E [∥∇f (x k )∥ D ] ≤ ε. In particular, we have: α optimal = µ D ε/L Proof. see appendix A, section A.4 We now state the complexity of MiSTP in the case of convex f . To do so, we add the following assumption: Assumption 4. We assume that f is convex, has a minimizer x * , and has bounded level set at x 0 : R 0 def = max{∥x -x * ∥ * D : f (x) ≤ f (x 0 )} < +∞, where ∥ξ∥ * D def = max{⟨ξ, x⟩ | ∥x∥ D ≤ 1} defines the dual norm to ∥ • ∥ D . Note that if the above assumption holds, then whenever f (x) ≤ f (x 0 ), we get f (x) -f (x * ) ≤ ⟨∇f (x), x -x * ⟩ = ∥∇f (x)∥ D (x -x * ) T ∇f (x)/∥∇f (x)∥ D ≤ ∥∇f (x)∥ D ∥x -x * ∥ * D ≤ R 0 ∥∇f (x)∥ D . That is, ∥∇f (x)∥ D ≥ f (x) -f (x * ) R 0 . ( ) Theorem 2 (convex case). Let Assumptions 1, 2 , 3, and 4 be satisfied. Let ε > 0 and σ |B| < (µ D ϵ) 2 4LR 2 0 , choose constant stepsize α k = α = εµ D LR0 , If K ≥ LR 2 0 µ 2 D ε log 4(f (x 0 ) -f (x * )) ε , ( ) then E [f (x K ) -f (x * )] ≤ ε. Proof. see appendix A, section A.5

4. NUMERICAL RESULTS

In this section, we report the results of some experiments conducted in order to evaluate the efficiency of MiSTP. All the presented results are averaged over 10 runs of the algorithm and the confidence intervals (the shaded region in the graphs) are given by µ ± σ 2 where µ is the mean and σ is the standard deviation. For each minibatch size, we choose the learning rate α by performing 

4.1. MISTP ON RIDGE REGRESSION AND REGULARIZED LOGISTIC REGRESSION PROBLEMS

We performed experiments on ridge regression and regularized logistic regression. They are problems with strongly convex objective function f . In the case of ridge regression we solve: min x∈R d f (x) = 1 2n n i=1 (A[i, :]x -yi) 2 + λ 2 ∥x∥ 2 2 (11) and in the case of regularized logistic regression we solve: min x∈R d f (x) = 1 2n n i=1 ln(1 + exp(-yiA[i, :]x)) + λ 2 ∥x∥ 2 2 (12) In both problems A ∈ R n×d , y ∈ R n are the given data and λ > 0 is the regularization parameter. For logistic regression: y ∈ {-1, 1} n and all the values in the first column of A are equal to 1.foot_0 For both problems we set λ = 1/n. The experiments of this section are conducted using LIBSVM datasets (Chang & Lin, 2011) . In section 4.1.1, we evaluate the performance of MiSTP when using different minibatch sizes. In section 4.1.2 we evaluate the performance of MiSTP compared to SGD, and in section 4.1.3 we compare the performance of MiSTP with some other ZO methods.

4.1.1. MISTP WITH DIFFERENT MINIBATCH SIZES

Figures 1 and 2 show the performance of MiSTP when using different minibatch sizes. From these figures we see good performance of MiSTP. For different minibatch sizes, it generally converges faster than the original STP (the full batch) in terms of number of epochs. We notice also that there is an optimal minibatch size that gives the best performance for each dataset: among the tested values, for the 'abalone' dataset it is equal to 50, for 'splice' dataset it is 1, for 'a1a' and 'australian' datasets it is 10. All those optimal minibatch sizes are just a very small subset of the whole dataset which results in less computation at each iteration. Those results also show that we could get a good performance when using only an approximation of the objective function using a small subset of the data rather than the exact function evaluations. In this section we report some results of experiments conducted in order to compare the performance of MiSTP to SGD. For both methods, we used the same starting point at each run and the same minibatch at each iteration. Figures 3 and 4 show results of experiments on ridge regression and regularized logistic regression problems respectively. More results are presented in Appendix B. From these experiments we see that in most of the cases, MiSTP is able to converge to a good approximation or exactly the same solution as SGD. MiSTP also gives competitive performance to SGD when the dimension of the problem is small, i.e., d is less or around 10. When the dimension of the problem is big, i.e., d is of order of tens, MiSTP needs more iterations compared to SGD to converge to just an approximation of the solution. In all cases, we see that the number of iterations that MiSTP needs to converge increases as the batch size decreases. It also increases as the dimension of the problem increases while SGD is slightly affected by this. In Appendix B, we report the values of the approximation f B alongside f for multiple minibatch sizes. We can see that starting from a given batch size (generally when τ ≥ 500 for the given datasets) f B is a good approximation of f which shows that we can get the same results when training a model with only a subset of the data as when using all available samples. Consequently, this results in less computations. In this section, we compare the performance of MiSTP with three other ZO optimization methods. The first is RSGF, proposed by Ghadimi & Lan (2013) . In this method, at iteration k, the iterate is updated as follow: x k+1 = x k -α k fB k (x k + µ k s k ) -fB k (x k ) µ k s k where µ k ∈ (0, 1) is the finite differences parameter, α k is the stepsize, s k is a random vector following the uniform distribution on the unit sphere, and B k is a randomly chosen minibatch. The second is ZO-SVRG proposed by Liu et al. (2018, Algorithm 2) . For this method, at iteration k, the gradient estimation of f B k at x k is given by: ∇fB k (x k ) = d µ (fB k (x k + µs k ) -fB k (x k ))s k (14) where µ > 0 is the smoothing parameter and s k is a random direction drawn from the uniform distribution over the unit sphere. And the last is ZO-CD (ZO coordinates descent method), in this method, at iteration k, the iterate is updated as follow: x k+1 = x k -α k gB k , gB k = d i=1 fB k (x k + µei) -fB k (x k -µei) 2µ ei where µ > 0 is a smoothing parameter and e i ∈ R d for i ∈ [d] is a standard basis vector with 1 at its ith coordinate and 0 elsewhere. The distribution D used here for MiSTP is the uniform distribution on the unit sphere. For RSGF, ZO-SVRG, and ZO-CD, we chose µ k = µ = 10 -4 

5. CONCLUSION

In this paper, we proposed the MiSTP method to extend the STP method to case of using only an approximation of the objective function at each iteration assuming the error between the objective function and its approximation is bounded. MiSTP sample the search directions in the same way as STP, but instead of comparing the objective function at three points it compares an approximation. We derived our method's complexity in the case of nonconvex and convex objective function. The presented numerical results showed encouraging performance of MiSTP. In some settings, it showed superior performance over the original STP. There are a number of interesting future works to further extend our method, namely deriving a rule to find the optimal minibatch size, comparing the performance of MiSTP with other zero-order methods on deep neural networks problems, extending MiSTP to the case of distributed learning, and investigating MiSTP in the non-smooth case.



It is known that the value of the first feature must be 1 for all the training inputs when performing logistic regression. When using LIBSVM datasets we add this column to the data.



Figure 1: Performance of MiSTP with different minibatch sizes on ridge regression problem. On the left, the abalone dataset. On the right, the splice dataset.

Figure 2: Performance of MiSTP with different minibatch sizes on regularized logistic regression problem. On the left, the a1a dataset. On the right, the australian dataset.

Figure 3: Performance of MiSTP and SGD on ridge regression problem using real data from LIB-SVM. Above, abalone dataset: n = 4177 and d = 8. Below, a1a dataset :n = 1605 and d = 123.

Figure 4: Performance of MiSTP and SGD on regularized logistic regression problem using real data from LIBSVM. Above, australian dataset : n = 690 and d = 15. Below, a1a dataset : n = 1605 and d = 124.

Figure (5)  shows the objective function values against the number of function queries of the different ZO methods using different minibatch sizes. Note that one function query is the evaluation of one f i for i ∈ [n] at a given point. From figure (5) we see that, on the ridge regression problem, MiSTP, RSGF, and ZO-CD show competitive performance while ZO-SVRG needs much more function queries to converge. On the regularized logistic regression problem, MiSTP outperforms all the other methods. RSGF, ZO-CD, and ZO-SVRG need almost 5 times function queries to converge than MiSTP for τ = 100 and around 2 times more function queries than MiSTP for τ = 50.

Figure 5: Comparison of MiSTP, RSGF, ZO-SVRG, and ZO-CD. Above: ridge regression problem using the splice dataset. Below: regularized logistic regression problem using the a1a dataset.

Figure 6: Comparison of different minibatch sizes for MiSTP in a multi-layer neural network.

