ENHANCED FIRST AND ZEROTH ORDER VARIANCE REDUCED ALGORITHMS FOR MIN-MAX OPTIMIZA-TION Anonymous

Abstract

Min-max optimization captures many important machine learning problems such as robust adversarial learning and inverse reinforcement learning, and nonconvexstrongly-concave min-max optimization has been an active line of research. Specifically, a novel variance reduction algorithm SREDA was proposed recently by (Luo et al. 2020) to solve such a problem, and was shown to achieve the optimal complexity dependence on the required accuracy level . Despite the superior theoretical performance, the convergence guarantee of SREDA requires stringent initialization accuracy and an -dependent stepsize for controlling the per-iteration progress, so that SREDA can run very slowly in practice. This paper develops a novel analytical framework that guarantees the SREDA's optimal complexity performance for a much enhanced algorithm SREDA-Boost, which has less restrictive initialization requirement and an accuracy-independent (and much bigger) stepsize. Hence, SREDA-Boost runs substantially faster in experiments than SREDA. We further apply SREDA-Boost to propose a zeroth-order variance reduction algorithm named ZO-SREDA-Boost for the scenario that has access only to the information about function values not gradients, and show that ZO-SREDA-Boost outperforms the best known complexity dependence on . This is the first study that applies the variance reduction technique to zeroth-order algorithm for min-max optimization problems.

1. INTRODUCTION

Min-max optimization has attracted significant growth of attention in machine learning as it captures several important machine learning models and problems including generative adversarial networks (GANs) Goodfellow et al. (2014) , robust adversarial machine learning Madry et al. (2018) , imitation learning Ho & Ermon (2016), etc. Min-max optimization typically takes the following form min x∈R d 1 max y∈R d 2 f (x, y), where f (x, y) E[F (x, y; ξ)] (online case) 1 n n i=1 F (x, y; ξ i ) (finite-sum case) (1) where f (x, y) takes the expectation form if data samples ξ are taken in an online fashion, and f (x, y) takes the finite-sum form if a dataset of training samples ξ i for i = 1, . . . , n are given in advance. This paper focuses on the nonconvex-strongly-concave min-max problem, in which f (x, y) is nonconvex with respect to x for all y ∈ R d2 , and f (x, y) is µ-strongly concave with respect to y for all x ∈ R d1 . The problem then takes the following equivalent form: min x∈R d 1 Φ(x) max y∈R d 2 f (x, y) . The objective function Φ(•) in eq. ( 2) is nonconvex in general, and hence algorithms for solving eq. ( 2) are expected to attain an approximate (i.e., -accurate) first-order stationary point. The convergence of deterministic algorithms for solving eq. 2019), which respectively achieve the overall complexity of O(κ 3 -4 log(1/ ))foot_0 , O(κ 3 -4 ), and O(κ 3 -4 ). Furthermore, several variance reduction methods have been proposed for solving eq. ( 2) for the nonconvex-strongly-concave case. PGSVRG Rafique et al. ( 2018) adopts a proximally guided SVRG method and achieves the overall complexity of O(κ 3 -4 ) for the online case and O(κfoot_1 n -2 ) for the finite-sum case. 2018) of recursive variance reduction method (originally designed for solving the minimization problem) for designing gradient estimators to update both x and y. Specifically, x takes the normalized gradient update in the outer-loop and each update of x is followed by an entire innerloop updates of y. Luo et al. (2020) showed that SREDA achieves an overall complexity of O(κ 3 -3 ) for the online case in eq. ( 1), which attains the optimal dependence on Arjevani et al. ( 2019). For the finite-sum case, SREDA achieves the complexity of O(κ 2 √ n -2 + n + (n + k) log(κ/ )) for n ≥ κ 2 , and O((κ 2 + κn) -2 ) for n ≤ κ 2 . Although SREDA achieves the optimal complexity performance in theory, two issues may substantially degrade its practice performance. (1) SREDA has a stringent requirement on the initialization accuracy ζ = κ -2 2 , which hence requires O(κ 2 -2 log(κ/ )) gradient estimations in the initialization, and is rather costly in the high accuracy regime (i.e., for small ). ( 2) The convergence of SREDA requires the stepsize to be substantially small, i.e., at the -level with α t = O(min{ /(κ v t 2 ), 1/(κ )}), which restricts each iteration to make only -level progress with x t+1 -x t 2 = O( /κ ). Consequently, SREDA can run very slowly in practice. • Thus, a vital question arising here is whether we can guarantee the same optimal complexity performance of SREDA even if we considerably relax its initialization (i.e., much bigger than O( 2)) and enlarge its stepsize (i.e., much bigger than O( )). The answer is highly nontrivial because the original analysis framework for SREDA in Luo et al. ( 2020) critically relies on these restrictions. The first focus of this paper is on developing a novel analytical framework to guarantee that such enhanced SREDA continues to hold the SREDA's optimal complexity performance. Furthermore, in many machine learning scenarios, min-max optimization problems need to be solved without the access of the gradient information, but only the function values, e.g., in multi-agent reinforcement learning with bandit feedback Wei et al. ( 2017 -4 log(1/ )) among the zeroth-order algorithms for this problem. All of the above studies are of SGD-type, and no efforts have been made on developing variance reduction zeroth-order algorithms for nonconvex-strongly-concave min-max optimization to further improve the query complexity. • The second focus of this paper is on applying the aforementioned enhanced SREDA algorithm to design a zeroth-order variance reduced algorithm for nonconvex-strongly-concave min-max problems, and further characterizing its complexity guarantee which we anticipate to be orderwisely better than that of the existing stochastic algorithms.

1.1. MAIN CONTRIBUTIONS

This paper first studies an enhanced SREDA algorithm, called SREDA-Boost, which improves SREDA from two aspects. (1) For the initialization, SREDA-Boost requires only an accuracy of ζ = κ -1 , which is much less stringent than that of ζ = κ -2 2 required by SREDA. (2) SREDA-Boost allows an accuracy-independent stepsize α = O(1/(κ )), which is much larger than the -level stepsize α = O(min{ /(κ v t 2 ), 1/(κ )}) adopted by SREDA. Hence, SREDA-Boost can run much faster than SREDA.



The constant κ = /µ, where µ is the strong concavity parameter of f (x, •), and is the Lipschitz constant of the gradient of f (x, y) as defined in Assumption Typically, κ is much larger than one.



(2) has been established in Jin et al. (2019); Nouiehed et al. (2019); Thekumparampil et al. (2019); Lu et al. (2020). SGD-type of stochastic algorithms have also been proposed to solve such a problem more efficiently, including SGDmax Jin et al. (2019), PGSMD Rafique et al. (2018), and SGDA Lin et al. (

Wai et al. (2019) converted the value function evaluation problem to a specific minmax problem and applied SAGA to achieve the overall complexity of O(κn -2 ) for the finite-sum case. More recently, Luo et al. (2020) proposed a novel nested-loop algorithm named Stochastic Recursive Gradient Descent Ascent (SREDA), which adopts SARAH/SPIDER-type Nguyen et al. (2017a); Fang et al. (

); Zhang et al. (2019) and robotics Wang & Jegelka (2017); Bogunovic et al. (2018). This motivates the design of zeroth-order (i.e., gradient-free) algorithms. For nonconvex-strongly-concave min-max optimization, Liu et al. (2019) studied a constrained problem and proposed ZO-min-max algorithm that achieves the computational complexity of O((d 1 + d 2 ) -6 ). Wang et al. (2020) designed ZO-SGDA and ZO-SGDMSA, where ZO-SGDMA achieves the best known query complexity of O((d 1 + d 2 )κ 2

