ENHANCED FIRST AND ZEROTH ORDER VARIANCE REDUCED ALGORITHMS FOR MIN-MAX OPTIMIZA-TION Anonymous

Abstract

Min-max optimization captures many important machine learning problems such as robust adversarial learning and inverse reinforcement learning, and nonconvexstrongly-concave min-max optimization has been an active line of research. Specifically, a novel variance reduction algorithm SREDA was proposed recently by (Luo et al. 2020 ) to solve such a problem, and was shown to achieve the optimal complexity dependence on the required accuracy level . Despite the superior theoretical performance, the convergence guarantee of SREDA requires stringent initialization accuracy and an -dependent stepsize for controlling the per-iteration progress, so that SREDA can run very slowly in practice. This paper develops a novel analytical framework that guarantees the SREDA's optimal complexity performance for a much enhanced algorithm SREDA-Boost, which has less restrictive initialization requirement and an accuracy-independent (and much bigger) stepsize. Hence, SREDA-Boost runs substantially faster in experiments than SREDA. We further apply SREDA-Boost to propose a zeroth-order variance reduction algorithm named ZO-SREDA-Boost for the scenario that has access only to the information about function values not gradients, and show that ZO-SREDA-Boost outperforms the best known complexity dependence on . This is the first study that applies the variance reduction technique to zeroth-order algorithm for min-max optimization problems.

1. INTRODUCTION

Min-max optimization has attracted significant growth of attention in machine learning as it captures several important machine learning models and problems including generative adversarial networks (GANs) Goodfellow et al. (2014) , robust adversarial machine learning Madry et al. (2018) , imitation learning Ho & Ermon (2016) , etc. Min-max optimization typically takes the following form min x∈R d 1 max y∈R d 2 f (x, y), where f (x, y) E[F (x, y; ξ)] (online case) 1 n n i=1 F (x, y; ξ i ) (finite-sum case) where f (x, y) takes the expectation form if data samples ξ are taken in an online fashion, and f (x, y) takes the finite-sum form if a dataset of training samples ξ i for i = 1, . . . , n are given in advance. This paper focuses on the nonconvex-strongly-concave min-max problem, in which f (x, y) is nonconvex with respect to x for all y ∈ R d2 , and f (x, y) is µ-strongly concave with respect to y for all x ∈ R d1 . The problem then takes the following equivalent form: min x∈R d 1 Φ(x) max y∈R d 2 f (x, y) . The objective function Φ(•) in eq. ( 2) is nonconvex in general, and hence algorithms for solving eq. ( 2) are expected to attain an approximate (i.e., -accurate) first-order stationary point. The convergence of deterministic algorithms for solving eq. ( 2) has been established in Jin et al. (2019) ; Nouiehed et al. (2019); Thekumparampil et al. (2019) ; Lu et al. (2020) . SGD-type of stochastic algorithms have also been proposed to solve such a problem more efficiently, including SGDmax Jin et al. (2019) , PGSMD Rafique et al. (2018) , and SGDA Lin et al. (2019) , which respectively achieve the overall complexity of O(κ 3 -4 log(1/ ))foot_0 , O(κ 3 -4 ), and O(κ 3 -4 ). Furthermore, several variance reduction methods have been proposed for solving eq. ( 2) for the nonconvex-strongly-concave case. PGSVRG Rafique et al. (2018) adopts a proximally guided SVRG method and achieves the overall complexity of O(κ 3 -4 ) for the online case and O(κfoot_1 n -2 ) for the finite-sum case. Wai et al. (2019) converted the value function evaluation problem to a specific minmax problem and applied SAGA to achieve the overall complexity of O(κn -2 ) for the finite-sum case. More recently, Luo et al. (2020) proposed a novel nested-loop algorithm named Stochastic Recursive Gradient Descent Ascent (SREDA), which adopts SARAH/SPIDER-type Nguyen et al. (2017a); Fang et al. (2018) of recursive variance reduction method (originally designed for solving the minimization problem) for designing gradient estimators to update both x and y. Specifically, x takes the normalized gradient update in the outer-loop and each update of x is followed by an entire innerloop updates of y. Luo et al. (2020) showed that SREDA achieves an overall complexity of O(κ 3 -3 ) for the online case in eq. ( 1), which attains the optimal dependence on Arjevani et al. (2019) . For the finite-sum case, SREDA achieves the complexity of O(κ 2 √ n -2 + n + (n + k) log(κ/ )) for n ≥ κ 2 , and O((κ 2 + κn) -2 ) for n ≤ κ 2 . Although SREDA achieves the optimal complexity performance in theory, two issues may substantially degrade its practice performance. (1) SREDA has a stringent requirement on the initialization accuracy ζ = κ -2 2 , which hence requires O(κ 2 -2 log(κ/ )) gradient estimations in the initialization, and is rather costly in the high accuracy regime (i.e., for small ). ( 2) The convergence of SREDA requires the stepsize to be substantially small, i.e., at the -level with α t = O(min{ /(κ v t 2 ), 1/(κ )}), which restricts each iteration to make only -level progress with x t+1 -x t 2 = O( /κ ). Consequently, SREDA can run very slowly in practice. • Thus, a vital question arising here is whether we can guarantee the same optimal complexity performance of SREDA even if we considerably relax its initialization (i.e., much bigger than O( 2)) and enlarge its stepsize (i.e., much bigger than O( )). The answer is highly nontrivial because the original analysis framework for SREDA in Luo et al. (2020) critically relies on these restrictions. The first focus of this paper is on developing a novel analytical framework to guarantee that such enhanced SREDA continues to hold the SREDA's optimal complexity performance. Furthermore, in many machine learning scenarios, min-max optimization problems need to be solved without the access of the gradient information, but only the function values, e.g., in multi-agent reinforcement learning with bandit feedback Wei et al. (2017) ; Zhang et al. (2019) and robotics Wang & Jegelka (2017) ; Bogunovic et al. (2018) . This motivates the design of zeroth-order (i.e., gradient-free) algorithms. For nonconvex-strongly-concave min-max optimization, Liu et al. (2019) studied a constrained problem and proposed ZO-min-max algorithm that achieves the computational complexity of O((d 1 + d 2 ) -6 ). Wang et al. (2020) designed ZO-SGDA and ZO-SGDMSA, where ZO-SGDMA achieves the best known query complexity of O((d 1 + d 2 )κ 2 -4 log(1/ )) among the zeroth-order algorithms for this problem. All of the above studies are of SGD-type, and no efforts have been made on developing variance reduction zeroth-order algorithms for nonconvex-strongly-concave min-max optimization to further improve the query complexity. • The second focus of this paper is on applying the aforementioned enhanced SREDA algorithm to design a zeroth-order variance reduced algorithm for nonconvex-strongly-concave min-max problems, and further characterizing its complexity guarantee which we anticipate to be orderwisely better than that of the existing stochastic algorithms.

1.1. MAIN CONTRIBUTIONS

This paper first studies an enhanced SREDA algorithm, called SREDA-Boost, which improves SREDA from two aspects. (1) For the initialization, SREDA-Boost requires only an accuracy of ζ = κ -1 , which is much less stringent than that of ζ = κ -2 2 required by SREDA. (2) SREDA-Boost allows an accuracy-independent stepsize α = O(1/(κ )), which is much larger than the -level stepsize α = O(min{ /(κ v t 2 ), 1/(κ )}) adopted by SREDA. Hence, SREDA-Boost can run much faster than SREDA.  ♦, ♥ FO SGDmax Θ(κ -1 -1 ) N/A O(κ 3 -4 log( 1 )) SGDA Θ(κ -2 -1 ) N/A O(κ 3 -4 ) PGSMD Θ(κ -2 ) N/A O(κ 3 -4 ) PGSVRG Θ(κ -2 ) N/A O(κ 3 -4 ) SREDA Θ(min{ κ v t 2 , 1 κ }) O(κ 2 -2 log( κ )) O(κ 3 -3 ) SREDA-Boost Θ((κ -1 -1 )) O(κ log(κ)) O(κ 3 -3 ) † ZO ZO-min-max Θ(κ -1 -1 ) N/A O((d -6 ) ZO-SGDA Θ(κ -4 -1 ) N/A O(dκ 5 -4 ) ZO-SGDMSA Θ(κ -1 -1 ) N/A O(dκ 2 -4 log( 1 )) ZO-SREDA-Boost Θ(κ -1 -1 ) O(κ log(κ)) O(dκ 3 -3 ) † We clarify that SREDA-Boost should not be expected to improve the complexity order of SREDA, because SREDA already achieves the optimal complexity. Rather, SREDA-Boost improves upon SREDA by much more relaxed requirements on initialization and stepsize to achieve such optimal performance. ‡ "FO" stands for "First-Order", and "ZO" stands for "Zeroth-Order". We include only the stepsize for updating xt for comparison. ♦ The complexity for first-order algorithms refer to the total gradient computations to attain an -stationary point, and for zeroth-order algorithms refers to the total function value queries. ♥ We include only the complexity in the online case in the table, because many studies did not cover the finite-sum case. We comment on the finite-sum case in Section 4 and Section 5.2. We define d = d1 + d2. The first contribution of this paper lies in developing a new analysis technique to provide the computational complexity guarantee for SREDA-Boost, establishing that even with considerably relaxed conditions on the initialization and stepsize, SREDA-Boost achieves the same optimal complexity performance as SREDA. The analysis technique of SREDA in Luo et al. (2020) does not handle such a case, because the proof highly relies on the stringent initialization and stepsize requirements. Central to our new analysis framework is the novel approach for bounding two interconnected stochastic error processes: tracking error and gradient estimation error (see Section 4 for their formal definitions), which take three steps: bounding the two error processes accumulatively over the entire algorithm execution, decoupling these two inter-related stochastic error processes, and establishing each of their relationships with the accumulative gradient estimators. The second contribution of this paper lies in proposing the zeroth-order variance reduced algorithm ZO-SREDA-Boost for nonconvex-strongly-conconve min-max optimization when the gradient information is not accessible. For the online case, we show that ZO-SREDA-Boost achieves an overall query complexity of O((d 1 + d 2 )κ 3 -3 ), which outperforms the best known complexity (achieved by ZO-SGDMSA Wang et al. (2020) ) in the case with ≤ κ -1 . For the finite-sum case, we show that ZO-SREDA-Boost achieves an overall query complexity of O((d 1 + d 2 )(κ 2 √ n -2 + n) + d 2 (κ 2 + κn) log(κ)) when n ≥ κ 2 , and O((d 1 + d 2 )(κ 2 + κn)κ -2 ) when n ≤ κ 2 . This is the first study that applies the variance reduction method to zeroth-order nonconvex-stronlgy-concave min-max optimization.

1.2. RELATED WORK

Due to the vast amount of studies on min-max optimization and the variance reduced algorithms, we include below only the studies that are highly relevant to this work. Variance reduction methods for min-max optimization are highly inspired by those for conventional minimization problems, including SAGA Defazio et al. (2014); Reddi et al. (2016) , SVRG Johnson & Zhang (2013) ; Allen-Zhu & Hazan (2016) ; Allen-Zhu (2017 ), SARAH Nguyen et al. (2017a; b; 2018) , SPIDER Fang et al. (2018) , SpiderBoost Wang et al. (2019) , etc. But the convergence analysis for min-max optimization is much more challenging, and is typically quite different from their counterparts in minimization problems. For strongly-convex-strongly-concave min-max optimization, Palaniappan & Bach (2016) applied SVRG and SAGA to the finite-sum case and established a linear convergence rate, and Chavdarova et al. (2019) proposed SVRE later to obtain a better bound. When the condition number of the problem is very large, Luo et al. (2019) proposed a proximal point iteration algorithm to improve the performance of SAGA. For some special cases, Du et al. (2017); Du & Hu (2019) showed that the linear convergence rate of SVRG can be maintained without the strongly-convex or strongly concave assumption. Yang et al. (2020) applied SVRG to study the min-max optimization under the two-sided Polyak-Lojasiewicz condition. Nonconvex-strongly-concave min-max optimization is the focus of this paper. As we discuss at the beginning of the introduction, the SGD-type algorithms have been developed and studied, including SGDmax Jin et al. (2019) , PGSMD Rafique et al. (2018) , and SGDA Lin et al. (2019) . Several variance reduction methods have also been proposed to further improve the performance, including PGSVRG Rafique et al. (2018) , the SAGA-type algorithm for min-max optimization Wai et al. (2019) , and SREDA Luo et al. (2020) . Particularly, SREDA has been shown in Luo et al. (2020) to achieve the optimal complexity dependence on . This paper further provides the convergence guarantee for SREDA-Boost (which enhances SREDA with relaxed initialization and much larger stepsize) by developing a new analysis technique. While SGD-type zeroth-order algorithms have been studied for min-max optimization, such as Menickelly & Wild (2020) ; Roy et al. (2019) for convex-concave min-max problems and Liu et al. (2019) ; Wang et al. (2020) for nonconvex-strongly-concave min-max problems, variance reduced algorithms have not been developed for zeroth-order min-max optimization so far. This paper proposes the first such an algorithm named ZO-SREDA-Boost for nonconvex-strongly-concave min-max optimization, and established its complexity performance that outperforms the existing comparable algorithms (see Table 1 ).

2. NOTATION AND PRELIMINARIES

In this paper, we use • 2 to denote the Euclidean norm of vectors. For a finite set S, we denote its cardinality as |S|. For a positive integer n, we denote [n] = {1, • • • , n}. We assume that the min-max problem eq. ( 2) satisfies the following assumptions, which have also been adopted by Luo et al. (2020) for SREDA. We slightly abuse the notation ξ below to represent the random index in both the online and finite-sum cases, where in the finite-sum case, E ξ [•] is with respect to the uniform distribution over {ξ 1 , • • • , ξ n }. Assumption 1. The function Φ(•) is lower bounded, i.e., we have Φ * = inf x∈R d 1 Φ(x) > -∞. Assumption 2. The component function F has an averaged -Lipschitz gradient, i.e., for all (x, y), (x , y ) ∈ R d1 × R d2 , we have E ξ ∇F (x, y; ξ) -∇F (x , y ; ξ) 2 2 ≤ 2 ( x -x 2 2 + y -y 2 2 ). Assumption 3. The function f is µ-strongly-concave in y for any x ∈ R d1 , and the component function F is concave in y, i.e., for any x ∈ R d1 , y, y ∈ R d2 and ξ, we have f (x, y) ≤ f (x, y ) + ∇ y f (x, y ), y -y -µ 2 y -y 2 , and F (x, y; ξ) ≤ F (x, y ; ξ) + ∇ y F (x, y ; ξ), y -y . Assumption 4. The gradient of each component function F (x, y; ξ) has a bounded variance, i.e., there exists a constant σ > 0 such that for any (x, y) ∈ R d1×d2 , we have E ξ ∇F (x, y; ξ) -∇f (x, y) 2 2 ≤ σ 2 < ∞. Since Φ is nonconvex in general, it is NP-hard to find its global minimum. The goal here is to develop stochastic gradient algorithms that output an -stationary point as defined below. Definition 1. The point x is called an -stationary point of the differentiable function Φ if ∇Φ(x) 2 ≤ , where is a positive constant.

3. SREDA AND SREDA-BOOST ALGORITHMS

We first introduce the SREDA algorithm proposed in Luo et al. (2020) , and then describe an enhanced algorithm SREDA-Boost that we study in this paper. SREDA (see Option I in Algorithm 1) utilizes the variance reduction techniques proposed in SARAH Nguyen et al. (2017a) and SPIDER Fang et al. (2018) for minimization problems to construct the gradient estimator recursively for min-max optimization. Specifically, the parameters x t and y t are updated in a nested loop fashion: each update of x t in the outer-loop is followed by (m + 1) updates of y t over one entire inner loop. Furthermore, the outer-loop updates of x t is divided into epochs for variance reduction. Consider a certain outer-loop epoch t = {(n t -1)q, • • • , n t q -1} (1 ≤ n t < T /q is a positive integer). At the beginning of such an epoch, the gradients are evaluated with a large batch size S 1 (see line 6 in Algorithm 1). Then, for each subsequent outer-loop iteration, an inner loop of ConcaveMaximizer (see Algorithm 2) recursively updates the gradient estimators for ∇ x f (x, y) and ∇ y f (x, y) with a small batch size S 2 . Note that although the inner loop does not update x, the gradient estimator ∇ x f (x, y) is updated in the inner loop. With such a variance reduction technique, SREDA outperforms all previous algorithms for nonconvex-strongly-concave min-max problems (see Table 1 ), and was shown to achieve the optimal dependency on in complexity Luo et al. (2020) . Algorithm 1 SREDA and SREDA-Boost 1: Input: x0, initial accuracy ζ, learning rate αt, β = O( 1 ), batch size S1, S2 and periods q, m. 2: Option I (SREDA): ζ = κ -2 2 ; Option II (SREDA-Boost): ζ = κ -1 3: Initialization: y0 = iSARAH(-f (x0, •), ζ) (see Appendix B.2 for iSARAH(•)) 4: for t = 0, 1, ..., T -1 do 5: if mod(t, q) = 0 then draw S1 samples {ξ1, • • • , ξS 1 } 6: vt = 1 S 1 S 1 i=1 ∇xF (xt, yt, ξi), ut = 1 S 1 S 1 i=1 ∇yF (xt, yt, ξi) 7: else 8: vt = ṽt-1, mt-1 , ut = ũt-1, mt-1 9: end if 10: Option I (SREDA): αt = min{ v t 2 , 1 2 }O( 1 κ ); Option II (SREDA-Boost): αt = α = O( 1 κ ) 11: xt+1 = xt -αtvt 12: yt+1 = ConcaveMaximizer(t, m, S2) 13: end for 14: Output: x chosen uniformly at random from {xt} T -1 t=0 Algorithm 2 ConcaveMaximizer(t, m, S 2 ) 1: Initialization: xt,-1 = xt, ỹt,-1 = yt, xt,0 = xt+1, ỹt,0 = yt, ṽt,-1 = vt, ũt,-1 = ut 2: Draw S2 samples {ξ1, • • • , ξS 2 } 3: ṽt,0 = ṽt,-1 + 1 S 2 S 2 i=1 ∇xF (xt,0, ỹt,0, ξi) -1 S 2 S 2 i=1 ∇xF (xt,-1, ỹt,-1, ξi) 4: ũt,0 = ũt,-1 + 1 S 2 S 2 i=1 ∇yF (xt,0, ỹt,0, ξi) -1 S 2 S 2 i=1 ∇yF (xt,-1, ỹt,-1, ξi) 5: xt,1 = xt,0, ỹt,1 = ỹt,0 + β ũt,0 6: for k = 1, 2, ..., m -1 do 7: draw S2 samples {ξ1, • • • , ξS 2 } 8: ṽt,k = ṽt,k-1 + 1 S 2 S 2 i=1 ∇xF (x t,k , ỹt,k , ξi) -1 S 2 S 2 i=1 ∇xF (x t,k-1 , ỹt,k-1 , ξi) 9: ũt,k = ũt,k-1 + 1 S 2 S 2 i=1 ∇yF (x t,k , ỹt,k , ξi) -1 S 2 S 2 i=1 ∇yF (x t,k-1 , ỹt,k-1 , ξi) 10: xt,k+1 = xt,k , ỹt,k+1 = ỹt,k + β ũt,k 11: end for 12: Output: yt+1 = ỹt, mt with mt chosen uniformly at random from {0, 1, • • • , m} Although SREDA achieves the optimal complexity performance in theory, two issues can substantially slow down its practical performance. (a) Its initialization y 0 needs to satisfy a stringent 2 -level requirement on the accuracy E[ ∇ y f (x 0 , y 0 ) 2 2 ] ≤ κ -2 2 (see line 2 in Algorithm 1), which requires as large as O(κ 2 -2 log(κ/ )) stochastic gradient computations Luo et al. (2020) . This is quite costly. (b) SREDA uses an -dependent stepsize and applies normalized gradient descent, so that each outer-loop update makes only -level progress given by x t+1 -x t 2 = O( /(κ )). This substantially slows down SREDA. By following the analysis of SREDA, it appears that such choices for initialization and stepsize are necessary to obtain the guaranteed convergence rate. In this paper, we study SREDA-Boost (see Option II in Algorithm 1) that enhances SREDA over the above two issues. (a) SREDA-Boost relaxes the initialization requirement to be E[ ∇ y f (x 0 , y 0 ) 2 2 ] ≤ κ -1 , which requires only O(κ log κ) gradient computations. This improves the computational cost upon SREDA by a factor of Õ(κ -2 ). (b) SREDA-Boost adopts an -independent stepsize α t = α = O(1/(κ )) for x t so that each outer-loop update can make much bigger progress than SREDA. As our experiments in Section 6 demonstrate, SREDA-Boost runs much faster than SREDA. To provide the convergence guarantee for SREDA-Boost, the analysis of SREDA in Luo et al. (2020) does not apply, because the proof highly depends on the stringent requirements on the initialization and the stepsize. Thus, this paper provides a new analysis technique for establishing the complexity performance guarantee for SREDA-Boost and further applies it to the gradient-free min-max problems.

4. CONVERGENCE ANALYSIS OF SREDA-BOOST

The following theorem provides the computational complexity of SREDA-Boost for finding a firstorder stationary point of Φ(•) with accuracy. Theorem 1. Apply SREDA-Boost to solve the online case of the problem eq. (1). Suppose Assumptions 1-4 hold. Let ζ = κ -1 , α = O(κ -1 -1 ), β = O( -1 ), q = O( -1 ), m = O(κ), S 1 = O(σ 2 κ 2 -2 ) and S 2 = O(κ -1 ). Then for T to be at least at the order of O(κ -2 ), Algorithm 1 outputs x that satisfies E[ ∇Φ(x) 2 ] ≤ with stochastic gradient complexity O(κ 3 -3 ). Furthermore, SREDA-Boost is also applicable to the finite-sum case of the problem eq. ( 1) by replacing the large batch S 1 of samples used in line 6 of Algorithm 1 with the full set of samples. Corollary 1. Apply SREDA-Boost described above to solve the finite-sum case of the problem eq. (1). Suppose Assumption 1-4 hold. Under appropriate parameter settings given in Appendix B.4, the overall gradient complexity to attain an -stationary point is O(κ 2 √ n -2 + n + (n + κ) log(κ)) for n ≥ κ 2 , and O((κ 2 + κn) -2 ) for n ≤ κ 2 . To compare with SREDA, as shown in Luo et al. (2020) , SREDA requires stringent initialization and stepsize selection to achieve the optimal complexity performance. In constrast, Theorem 1 and Corollary 1 show that those strict requirements are not necessary for achieving the optimal performance, and establish that SREDA-Boost achieves the same optimal complexity as SREDA under more relaxed initialization and a much bigger and accuracy-independent stepsize α. The convergence analysis of SREDA-Boost in Theorem 1 is very different from the proof of SREDA in Luo et al. (2020) . At a high level, such analysis mainly focuses on bounding two inter-related errors: tracking error δ t = E[ ∇ y f (x t , y t ) 2 2 ] that captures how well the output y t of the inner loop approximates the optimal point y * (x t ) for a given x t , and gradient estimation error ∆ t = E[ v t -∇ x f (x t , y t ) 2 2 + u t -∇ y f (x t , y t ) 2 2 ] that captures how well the stochastic gradient estimators approximate the true gradients. In the analysis of SREDA in Luo et al. (2020) , the stringent requirements for initialization and stepsize and the -level normalized gradient descent update substantially help to bound both errors δ t and ∆ t separately at the level for each iteration so that the convergence bound follows. In contrast, this is not applicable to SREDA-Boost which has relaxed and accuracy-independent initialization and stepsize. Hence, we develop a novel analysis framework to bound the accumulative errors T -1 t=0 δ t and T -1 t=0 ∆ t over the entire algorithm execution, and then decouple these two inter-related stochastic error processes and establish their relationships with the accumulative gradient estimators T -1 i=0 E[ v t 2 2 ]. The following proof sketch of Theorem 1 further illustrates our ideas. The analysis of SREDA-Boost for min-max problems is inspired by that for SpiderBoost in Wang et al. (2019) for minimization problems, but the analysis here is much more challenging due to the complicated mathematical nature of min-max optimization. Specifically, SpiderBoost needs to handle only one type of the gradient estimation error, whereas SREDA-Boost requires to handle two strongly coupled errors in min-max problems. Hence, the novelty for analyzing SREDA-Boost mainly lies in bounding and decoupling the two errors in order to characterize their impact on the convergence bound.

5. ZO-SREDA-BOOST AND CONVERGENCE ANALYSIS

In this section, we study the min-max problem when the gradient information is not available, but only function values can be used for designing algorithms. Based on the first-order SREDA-Boost algorithm, we first propose the zeroth-order variance reduced algorithm called ZO-SREDA-Boost and then provide the convergence analysis for such an algorithm. order variance reduction algorithm ZO-SREDA-Boost with the other existing zeroth-order stochastic algorithms and demonstrate the superior performance of ZO-SREDA-Boost. Our experiments solve a distributionally robust optimization problem, which is commonly used for studying min-max optimization Lin et al. (2019) ; Rafique et al. (2018) . We conduct the experiments on three datasets from LIBSVM Chang & Lin (2011) . The details of the problem and the datasets are provided in Appendix A. Comparison between SREDA-Boost and SREDA: We set = 0.001 for both algorithms. For SREDA, we set α t = min{ / v t 2 , 0.005} as specified by the algorithm, and for SREDA-Boost, we set α t = 0.005 as the algorithm allows. It can be seen in Figure 1 that SREDA-Boost enjoys a much faster convergence speed than SREDA due to the allowance of a large stepsize. Comparison among zeroth-order Algorithms: We compare the performance of our proposed ZO-SREDA-Boost with that of two existing stochastic algorithms ZO-SGDA Wang et al. (2020) and ZO-SGDMSA Wang et al. (2020) designed for nonconvex-strongly-concave min-max problems. For ZO-SGDA and ZO-SGDMSA, as suggested by the theorem, we set the mini-batch size B = Cd 1 / 2 and B = Cd 2 / 2 for updating the variables x and y, respectively. For ZO-SREDA-Boost, based on our theory, we set the mini-batch size B = Cd 1 / and B = Cd 2 / for updating the variables x and y, and set S 1 = n for the large batch, where n is the number of data samples in the dataset. We set C = 0.1 and = 0.1 for all algorithms. We further set the stepsize η = 0.01 for ZO-SREDA-Boost and ZO-SGDMSA. Since ZO-SGDA is a two time-scale algorithm, we set η = 0.01 as the stepsize for the fast time scale, and η/κ 3 as the stepsize for slow time scale (based on the theory) where κ 3 = 10. It can be seen in Figure 2 that ZO-SREDA-Boost substantially outperforms the other two algorithms in terms of the function query complexity (i.e., the running time). 

7. CONCLUSION

In this work, we have proposed enhanced variance reduction algorithms, which we call SREDA-Boost and ZO-SREDA-Boost, for solving nonconvex-strongly-concave min-max problems. In specific, SREDA-Boost requires less initialization effort and allows a large stepsize. Moreover, The complexity of SREDA-Boost and ZO-SREDA-Boost achieves the best complexity dependence on the targeted accuracy among their same classes of algorithms. We have also developed a novel analysis framework to characterize the convergence and computational complexity for the variance reduction algorithms. We expect such a framework will be useful for studying various other stochastic min-max problems such as proximal, momentum, and manifold optimization.



The constant κ = /µ, where µ is the strong concavity parameter of f (x, •), and is the Lipschitz constant of the gradient of f (x, y) as defined in Assumption Typically, κ is much larger than one.



Figure 1: Comparison of the convergence rate between SREDA-Boost and SREDA.

Figure 2: Comparison of function query complexity among three algorithms.

Comparison of stochastic algorithms for nonconvex-strongly-concave min-max problems

5.1. ZO-SREDA-BOOST ALGORITHM

The ZO-SREDA-Boost algorithm (see Algorithm 4 in Appendix C.1) shares the same update scheme as SREDA-Boost, but makes the following changes.(1) In line 3 of SREDA-Boost, instead of using iSARAH, ZO-SREDA-Boost utilizes a zeroth-order algorithm ZO-iSARAH (Algorithm 6 in Appendix C.4) to search an initialization y 0 .(2) At the beginning of each epoch in the outer loop (line 6 of SREDA-Boost), ZO-SREDA-Boost utilizes coordinate-wise gradient estimators with a large batch S 1 given by 2δ) , where e j denotes the j-th canonical unit basis vector. Note that the coordinate-wise gradient estimator is commonly taken in the zeroth-order variance reduce algorithms such as in Ji et al. (2019) for minimization problems.(3) ZO-SREDA-Boost replaces ConcaveMaximizer (line 12 of SREDA) by ZO-ConcaveMaximizer (see Algorithm 5), in which the zeroth-order gradient estimators are recursively updated with small batches S 2,x (for update of x) and S 2,y (for update of y) based on the Gaussian estimators given by G) with 1 d denoting the identity matrices with sizes d × d.

5.2. CONVERGENCE ANALYSIS OF ZO-SREDA-BOOST

The following theorem provides the query complexity of ZO-SREDA-Boost for finding a first-order stationary point of Φ(•) with accuracy. Theorem 2. Apply ZO-SREDA-Boost in Algorithm 4 to solve the online case of the problem eq. (1).) and µ 2 = O(d -1.5 2 κ -2.5 -1 ). Then for T to be at least at the order of O(κ -2 ), Algorithm 4 outputs x that satisfies E[ ∇Φ(x) 2 ] ≤ with the overall function query complexity O((d 1 + d 2 )κ 3 -3 ).Furthermore, ZO-SREDA-Boost is also applicable to the finite-sum case of the problem eq. ( 1), by replacing the large batch sample S 1 used in line 6 of Algorithm 4 with the full set of samples. Corollary 2. Apply ZO-SREDA-Boost described above to solve the finite-sum case of the problem eq. (1). Suppose Assumptions 1-4 hold. Under appropriate parameter settings given in Appendix C.6, the function query complexity to attain an -stationary point is O((Theorem 2 and Corollary 2 provide the first convergence analysis and the query complexity for the variance-reduced zeroth-order algorithms for min-max optimization. These two results indicate that the query complexity of ZO-SREDA-Boost matches the optimal dependence on of the first-order algorithm SREDA-Boost in Theorem 1 and Corollary 1. The dependence on d 1 and d 2 typically arises in zeroth-order algorithms due to the estimation of gradients with dimensions d 1 and d 2 . Furthermore, in the online case, ZO-SREDA-Boost outperforms the best known query complexity dependence on among the existing zeroth-order algorithms by a factor of O(1/ ). Including the conditional number κ into consideration, SREDA-Boost outperforms the best known query complexity achieved by ZO-SGDMA in the case with ≤ κ -1 (see Table 1 ). Furthermore, Corollary 2 provides the first query complexity for the finite-sum zeroth-order min-max problems.As a by-product, our analysis of ZO-SREDA-Boost also yields the convergence rate and the query complexity (see Lemma 21) for ZO-iSARAH for the conventional minimization problem, which provides the first complexity result for the zeroth-order recursive variance reduced algorithm SARAH/SPIDER for strongly convex optimization (see Appendix C.4 for detail).

6. EXPERIMENTS

Our experiments focus on two types of comparisons. First, we compare SREDA-Boost with SREDA to demonstrate the practical advantage of SREDA-Boost. Second, we compare our proposed zeroth-

