SGDA WITH SHUFFLING: FASTER CONVERGENCE FOR NONCONVEX-PŁ MINIMAX OPTIMIZATION

Abstract

Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monotone setups. To narrow this gap, we study the convergence bounds of SGDA with random reshuffling (SGDA-RR) for smooth nonconvexnonconcave objectives with Polyak-Łojasiewicz (PŁ) geometry. We analyze both simultaneous and alternating SGDA-RR for nonconvex-PŁ and primal-PŁ-PŁ objectives, and obtain convergence rates faster than with-replacement SGDA. Our rates extend to mini-batch SGDA-RR, recovering known rates for full-batch gradient descent-ascent (GDA). Lastly, we present a comprehensive lower bound for GDA with an arbitrary step-size ratio, which matches the full-batch upper bound for the primal-PŁ-PŁ case.

1. INTRODUCTION

A finite-sum minimax optimization problem aims to solve the following: min x∈X max y∈Y f (x; y) := 1 n n i=1 f i (x; y), where f i denotes the i-th component function. In plain language, we want to minimize the average of n component functions for x, while maximizing it for y given x. There are many important areas in modern machine learning that fall within the minimax problem, including generative adversarial networks (GANs) (Goodfellow et al., 2020) , adversarial attack and robust optimization (Madry et al., 2018; Sinha et al., 2018) , multi-agent reinforcement learning (MARL) (Li et al., 2019) , AUC maximization (Ying et al., 2016; Liu et al., 2020; Yuan et al., 2021) , and many more. In most cases, the objective f is usually nonconvex-nonconcave, i.e., neither convex in x nor concave in y. Since general nonconvex-nonconcave problems are known to be intractable, we would like to tackle the problems with some additional structures, such as smoothness and Polyak-Łojasiewicz (PŁ) condition(s). We elaborate the detailed settings for our analysis, nonconvex-PŁ and primal-PŁ-PŁ (or, PŁ(Φ)-PŁ), in Section 2. One of the simplest and most popular algorithms to solve the problem (1) would be stochastic gradient descent-ascent (SGDA). This naturally extends the idea of stochastic gradient descent (SGD) used for minimization problems. Given an initial iterate (x 0 ; y 0 ), at time t ∈ N, SGDA (randomly) chooses an index i(t) ∈ {1, . . . , n} and accesses the i(t)-th component to perform a pair of updates x t = x t-1 -α∇ 1 f i(t) (x t-1 ; y t-1 ), y t = y t-1 + β∇ 2 f i(t) (x ; y t-1 ), where x = x t-1 , (simSGDA), or x t , Here, α > 0 and β > 0 are the step sizes and ∇ j denotes the gradient with respect to j-th argument for f i(t) (j = 1, 2). As shown in the update equations above, there are two widely used versions of SGDA: simultaneous SGDA (simSGDA), and alternating SGDA (altSGDA). In such stochastic gradient methods, there are two main categories of sampling schemes for the component indices i(t). One way is to sample i(t) independently (in time) and uniformly at random from {1, . . . , n}, which is called with-replacement sampling. This scheme is widely adopted in theory papers because it makes analysis of stochastic methods amenable: the noisy gradients ∇f i(t) are independent over time t and are unbiased estimators of the full-batch gradient ∇f . In contrast, the vast majority of practical implementations employ without-replacement sampling, indicating a huge theory-practice gap. In without-replacement sampling, we sample each index precisely once at each epoch. Perhaps the most popular of such schemes is random reshuffling (RR), which uniformly randomly shuffles the order of indices at the beginning of every epoch. Unfortunately, it is wellknown that without-replacement methods are much more difficult to analyze theoretically, largely because the sampled indices in each epoch are no longer independent of each other. Interestingly, for minimization problems, several recent works overcome this obstacle and show that SGD using without-replacement sampling leads to faster convergence, given that the number of epochs is large enough (Nagaraj et al., 2019; Ahn et al., 2020; Mishchenko et al., 2020; Rajput et al., 2020; Nguyen et al., 2021; Yun et al., 2021; 2022) . On the other hand, for minimax problems like (1), the majority of the studies still assume with-replacement sampling and/or rely on independent unbiased gradient oracles (Nouiehed et al., 2019; Guo et al., 2020; Lin et al., 2020; Yan et al., 2020; Yang et al., 2020; Loizou et al., 2021; Beznosikov et al., 2022) . There are very few results on minimax algorithms using without-replacement sampling; even most of the existing ones take advantage of (strong-)convexity (in x) and/or (strong-)concavity (in y) (Das et al., 2022; Maheshwari et al., 2022; Yu et al., 2022) . Detailed comparative analysis of these works is conducted in Section 4. Putting all these issues into consideration, our main question is the following. Does SGDA using without-replacement component sampling provably converge fast, even on smooth nonconvex-nonconcave objective f with PŁ structures?

1.1. SUMMARY OF OUR CONTRIBUTIONS

To answer the question, we analyze the convergence of SGDA with random reshuffling (SGDA-RR, Algorithm 1). We analyze both the simultaneous and alternating versions of SGDA-RR and prove convergence theorems for the following two regimes. Here we denote the step size ratio as r = β/α. • When -f (x; y) satisfies µ 2 -PŁ condition in y (nonconvex-PŁ) and component function f i 's are L-smooth, we prove that SGDA-RR with r (L/µ 2 ) 2 converges to ε-stationarity in expectation after O nrLε -2 + √ nr 1.5 Lε -3 gradient evaluations (Theorem 1). • Further assuming µ 1 -PŁ condition on Φ(•) := max y f (•; y) (primal-PŁ-PŁ, or PŁ(Φ)-PŁ), we prove that SGDA-RR with r (L/µ 2 ) 2 converges within ε-accuracy in expectation after Õ nLr µ 1 log(ε -1 ) + √ nL( r µ 1 ) 1.5 ε -1 gradient evaluations (Theorem 2). As will be discussed in Section 4, the rates shown above are faster than existing results on withreplacement SGDA. In fact, Theorems 1 & 2 are special cases (b = 1) of our extended theorems (Theorems 4 & 5 in Appendix A) that analyze mini-batch SGDA-RR of batch size b; by setting b = n, we also recover known convergence rates for full-batch gradient descent ascent (GDA). Hence, our analysis covers the entire spectrum between vanilla SGDA-RR (b = 1) and GDA (b = n). • Additionally, we provide complexity lower bounds for solving strongly-convex-stronglyconcave (SC-SC) minimax problems using full-batch simultaneous GDA with an arbitrarily fixed step size ratio r = β/α. Perhaps surprisingly, we find that the lower bound for SC-SC functions matches the convergence upper bound for a much larger class of primal-PŁ-PŁ functions when the step size ratio satisfies r L 2 /µ 2 2 (Theorem 3).

2.1. NOTATION

In our problem (1), the domain of every f i is Z = X × Y, where X = R dx , Y = R dy , and Z = R d : we concern unconstrained problems for simplicity. We denote the Euclidean norm and the standard inner product by • and •, • , respectively. We often use an abbreviated notation z = (x; y) ∈ Z for x ∈ X and y ∈ Y. Even when z or (x; y) is followed by superscripts and/or subscripts, we use the symbols interchangeably; e.g., z k i = (x k i ; y k i ). Note that we split the arguments x (for minimization) and y (for maximization) by a semicolon (';'). We use ∇ 1 and ∇ 2 to denote the

