SGDA WITH SHUFFLING: FASTER CONVERGENCE FOR NONCONVEX-PŁ MINIMAX OPTIMIZATION

Abstract

Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monotone setups. To narrow this gap, we study the convergence bounds of SGDA with random reshuffling (SGDA-RR) for smooth nonconvexnonconcave objectives with Polyak-Łojasiewicz (PŁ) geometry. We analyze both simultaneous and alternating SGDA-RR for nonconvex-PŁ and primal-PŁ-PŁ objectives, and obtain convergence rates faster than with-replacement SGDA. Our rates extend to mini-batch SGDA-RR, recovering known rates for full-batch gradient descent-ascent (GDA). Lastly, we present a comprehensive lower bound for GDA with an arbitrary step-size ratio, which matches the full-batch upper bound for the primal-PŁ-PŁ case.

1. INTRODUCTION

A finite-sum minimax optimization problem aims to solve the following: min x∈X max y∈Y f (x; y) := 1 n n i=1 f i (x; y), where f i denotes the i-th component function. In plain language, we want to minimize the average of n component functions for x, while maximizing it for y given x. There are many important areas in modern machine learning that fall within the minimax problem, including generative adversarial networks (GANs) (Goodfellow et al., 2020) , adversarial attack and robust optimization (Madry et al., 2018; Sinha et al., 2018) , multi-agent reinforcement learning (MARL) (Li et al., 2019) , AUC maximization (Ying et al., 2016; Liu et al., 2020; Yuan et al., 2021) , and many more. In most cases, the objective f is usually nonconvex-nonconcave, i.e., neither convex in x nor concave in y. Since general nonconvex-nonconcave problems are known to be intractable, we would like to tackle the problems with some additional structures, such as smoothness and Polyak-Łojasiewicz (PŁ) condition(s). We elaborate the detailed settings for our analysis, nonconvex-PŁ and primal-PŁ-PŁ (or, PŁ(Φ)-PŁ), in Section 2. One of the simplest and most popular algorithms to solve the problem (1) would be stochastic gradient descent-ascent (SGDA). This naturally extends the idea of stochastic gradient descent (SGD) used for minimization problems. Given an initial iterate (x 0 ; y 0 ), at time t ∈ N, SGDA (randomly) chooses an index i(t) ∈ {1, . . . , n} and accesses the i(t)-th component to perform a pair of updates x t = x t-1 -α∇ 1 f i(t) (x t-1 ; y t-1 ), y t = y t-1 + β∇ 2 f i(t) (x ; y t-1 ), where x = x t-1 , (simSGDA), or x t , (altSGDA). Here, α > 0 and β > 0 are the step sizes and ∇ j denotes the gradient with respect to j-th argument for f i(t) (j = 1, 2). As shown in the update equations above, there are two widely used versions of SGDA: simultaneous SGDA (simSGDA), and alternating SGDA (altSGDA). In such stochastic gradient methods, there are two main categories of sampling schemes for the component indices i(t). One way is to sample i(t) independently (in time) and uniformly at random from {1, . . . , n}, which is called with-replacement sampling. This scheme is widely adopted in theory papers because it makes analysis of stochastic methods amenable: the noisy gradients ∇f i(t) are independent over time t and are unbiased estimators of the full-batch gradient ∇f . In contrast, the vast majority of practical implementations employ without-replacement sampling, indicating a huge theory-practice gap. In without-replacement sampling, we sample each index precisely once at each epoch. Perhaps the most popular of such schemes is random reshuffling (RR), which uniformly randomly shuffles the order of indices at the beginning of every epoch. Unfortunately, it is wellknown that without-replacement methods are much more difficult to analyze theoretically, largely because the sampled indices in each epoch are no longer independent of each other. Interestingly, for minimization problems, several recent works overcome this obstacle and show that SGD using without-replacement sampling leads to faster convergence, given that the number of epochs is large enough (Nagaraj et al., 2019; Ahn et al., 2020; Mishchenko et al., 2020; Rajput et al., 2020; Nguyen et al., 2021; Yun et al., 2021; 2022) . On the other hand, for minimax problems like (1) , the majority of the studies still assume with-replacement sampling and/or rely on independent unbiased gradient oracles (Nouiehed et al., 2019; Guo et al., 2020; Lin et al., 2020; Yan et al., 2020; Yang et al., 2020; Loizou et al., 2021; Beznosikov et al., 2022) . There are very few results on minimax algorithms using without-replacement sampling; even most of the existing ones take advantage of (strong-)convexity (in x) and/or (strong-)concavity (in y) (Das et al., 2022; Maheshwari et al., 2022; Yu et al., 2022) . Detailed comparative analysis of these works is conducted in Section 4. Putting all these issues into consideration, our main question is the following. Does SGDA using without-replacement component sampling provably converge fast, even on smooth nonconvex-nonconcave objective f with PŁ structures?

1.1. SUMMARY OF OUR CONTRIBUTIONS

To answer the question, we analyze the convergence of SGDA with random reshuffling (SGDA-RR, Algorithm 1) . We analyze both the simultaneous and alternating versions of SGDA-RR and prove convergence theorems for the following two regimes. Here we denote the step size ratio as r = β/α. • When -f (x; y) satisfies µ 2 -PŁ condition in y (nonconvex-PŁ) and component function f i 's are L-smooth, we prove that SGDA-RR with r (L/µ 2 ) 2 converges to ε-stationarity in expectation after O nrLε -2 + √ nr 1.5 Lε -3 gradient evaluations (Theorem 1). • Further assuming µ 1 -PŁ condition on Φ(•) := max y f (•; y) (primal-PŁ-PŁ, or PŁ(Φ)-PŁ), we prove that SGDA-RR with r (L/µ 2 ) 2 converges within ε-accuracy in expectation after Õ nLr µ 1 log(ε -1 ) + √ nL( r µ 1 ) 1.5 ε -1 gradient evaluations (Theorem 2). As will be discussed in Section 4, the rates shown above are faster than existing results on withreplacement SGDA. In fact, Theorems 1 & 2 are special cases (b = 1) of our extended theorems (Theorems 4 & 5 in Appendix A) that analyze mini-batch SGDA-RR of batch size b; by setting b = n, we also recover known convergence rates for full-batch gradient descent ascent (GDA). Hence, our analysis covers the entire spectrum between vanilla SGDA-RR (b = 1) and GDA (b = n). • Additionally, we provide complexity lower bounds for solving strongly-convex-stronglyconcave (SC-SC) minimax problems using full-batch simultaneous GDA with an arbitrarily fixed step size ratio r = β/α. Perhaps surprisingly, we find that the lower bound for SC-SC functions matches the convergence upper bound for a much larger class of primal-PŁ-PŁ functions when the step size ratio satisfies r L 2 /µ 2 2 (Theorem 3).

2.1. NOTATION

In our problem (1) , the domain of every f i is Z = X × Y, where X = R dx , Y = R dy , and Z = R d : we concern unconstrained problems for simplicity. We denote the Euclidean norm and the standard inner product by • and •, • , respectively. We often use an abbreviated notation z = (x; y) ∈ Z for x ∈ X and y ∈ Y. Even when z or (x; y) is followed by superscripts and/or subscripts, we use the symbols interchangeably; e.g., z k i = (x k i ; y k i ). Note that we split the arguments x (for minimization) and y (for maximization) by a semicolon (';'). We use ∇ 1 and ∇ 2 to denote the Algorithm 1 simSGDA/altSGDA-RR 1: Given: The number of components n; the number of epochs K; step sizes α, β > 0 2: Initialize: (xfoot_0 0 ; y 1 0 ) ∈ R dx × R dy 3: for k ∈ [K] do for i ∈ [n] do 6: x k i = x k i-1 -α∇ 1 f σ k (i) (x k i-1 ; y k i-1 ) 7: if simSGDA-RR then 8: y k i = y k i-1 + β∇ 2 f σ k (i) (x k i-1 ; y k i-1 ) simultaneous update: x & y 9: else if altSGDA-RR then 10: y k i = y k i-1 + β∇ 2 f σ k (i) (x k i ; y k i-1 ) alternating update: x → y 11: (x k+1 0 ; y k+1 0 ) = (x k n ; y k n ) gradients with respect to first and second arguments, respectively.Accordingly, we can write the full gradient as, e.g., ∇g = [∇ 1 g ; ∇ 2 g ] . For a positive integer N , we denote [N ] := {1, . . . , N }. Let the set S N be a symmetric group of degree N . That is, each permutation σ ∈ S N is a bijection from [N ] to itself, or equivalently, a re-arrangement of [N ] . Lastly, we use the usual O/Ω/Θ notation for bounds, where Õ/ Ω/ Θ are used for hiding some logarithmic factors, respectively.

2.2. ALGORITHMS: SIMSGDA-RR & ALTSGDA-RR

As we explained in Section 1, we consider simSGDA and altSGDA combined with RR, a withoutreplacement sampling scheme. We call them simSGDA-RR and altSGDA-RR, respectively. We present a detailed description of the methods in Algorithm 1. For completeness, we also provide an extended version that uses mini-batches of size ≥ 1 (Algorithm 2) in Appendix A. For comparison, we call the SGDA algorithms using with-replacement sampling by just simSGDA and altSGDA. The quantities α, β > 0 are step sizes associated with x and y, respectively. We use two separate symbols α and β to allow the two step sizes to be different. Such algorithms are sometimes called two-time-scale algorithms, in a broader sense, and they are adopted in nonconvex minimax optimization problems (Heusel et al., 2017; Lin et al., 2020; Yang et al., 2020) . In fact, a recent result (Li et al., 2022) shows that having α = β is sometimes necessary for convergence.

2.3. ASSUMPTIONS AND DEFINITIONS

To define the function classes that we are interested in solving, we introduce a few assumptions. Assumption 1 (Component smoothness). Every i-th component f i : Z → R is L-smooth, i.e., f i is differentiable and ∇f i is L-Lipschitz continuous: ∇f i (z)-∇f i ( z) ≤ L zz . As a result, f i ( z)-f i (z) ≤ ∇f i (z), z-z + L 2 z-z 2 (∀z, z) and the average f of f i 's is also L-smooth. 1 Assumption 2 (Component gradient variance). There exist constants A, B ≥ 0 such that, for any z = (x; y) ∈ Z and j ∈ {1, 2}, we have 1 n n i=1 ∇ j f i (z) -∇ j f (z) 2 ≤ A ∇ j f (z) 2 + B. Assumption 3. For a function f : X × Y → R, the primal function Φ : X → R is well-defined as Φ(x) := max y ∈Y f (x; y ). For each x ∈ X , the set Y * x := arg max y ∈Y f (x; y ) is non-empty and closed. Moreover, we assume Φ(x) is bounded below by Φ * = inf x ∈X Φ(x ) > -∞. Note that Assumption 2 controls the discrepancy between the objective function f and its components f i 's; it is similar to Assumption 2 of Nguyen et al. (2021) , adapted to minimax problems. Letting A = 0 recovers a common assumption of the uniformly bounded variance of component gradients; thus, our assumption is a relaxation. Also, note that A = B = 0 when n = 1. We now add an additional structure to our objective function, which is called Polyak-Łojasiewicz (PŁ) condition. A function g : R d → R is said to be µ-PŁ if it has a minimum value g * and satisfies ∇g(t) 2 ≥ 2µ(g(t) -g * ). (∀ t ∈ R d ) (µ-PŁ) Readers could find several studies and applications that the condition involves, in the papers by Karimi et al. (2016); Nouiehed et al. (2019) ; Yang et al. (2020) ; Liu et al. (2020) , and more. Note that every µ-strongly convexfoot_1 function satisfies µ-PŁ condition, whereas a PŁ function does not need to be convex. Hence, µ-PŁ is a strict generalization of µ-strong convexity. In addition, every stationary point of a PŁ function is a global optimum, which is a benign property for optimization. We are interested in the case where our objective function f (x; y) has such a structure in terms of y (Assumption 4). Sometimes, we further assume the primal function Φ is also PŁ (Assumption 5). We emphasize that we do not necessarily assume the PŁ conditions for the individual f i 's. Assumption 4 (y-side PŁ). For each (fixed ) x ∈ X , -f (x; •) is µ 2 -PŁ, i.e., for every (x; y) ∈ Z, ∇ 2 f (x; y) 2 ≥ 2µ 2 (Φ(x) -f (x; y)) , where Φ is the primal function associated with f . Assumption 5 (Primal PŁ, or PŁ(Φ)). The primal function Φ(•) = max y f (x; y ) of f is µ 1 -PŁ, i.e., for every x ∈ X , ∇ Φ(x) 2 ≥ 2µ 1 (Φ(x) -Φ * ), where Φ * = min x Φ(x) is well-defined. We say the function f is nonconvex-PŁ when it satisfies Assumption 4. Since we do not assume any convexity/concavity, it is generally hard to reach global optima. Due to the y-side PŁ condition, we can guarantee that the primal function Φ is differentiable and even L Φ -smooth with L Φ ≤ L+L 2 /µ 2 (Proposition 9 in Appendix B). Since the problem ( 1) can be reformulated as the minimization problem of Φ (when we can always find y well that maximizes f (x; y) given x), we could aim to find an approximate first-order stationary point of Φ, by making the norm of the gradient of Φ small. On top of that, if f satisfies both Assumptions 4 and 5, the function is said to be primal-PŁ-PŁ, or PŁ(Φ)-PŁ for short. 3 In this case, we directly aim not only to decrease the primal function Φ associated with the objective function f but also to increase the function value f (x; y) in terms of y. To evaluate how close we are to our goal, we define a potential function V λ later in Section 3. When we attain V λ (x * , y * ) = 0, it implies that we arrive at a global minimax point: f (x * , y * ) = Φ(x * ) = Φ * . The function V λ enables us to develop a unified analysis for nonconvex-PŁ and PŁ(Φ)-PŁ objective functions; we discuss this in greater detail in Section 3.

3. MAIN RESULTS

Based on the assumptions stated in the previous section, we present the convergence results for both smooth nonconvex-PŁ objectives and smooth PŁ(Φ)-PŁ objectives. Before stating the main theorems, we first introduce the most important tool for our analyses: the potential function.

3.1. POTENTIAL FUNCTION V λ

For our convergence analyses, we utilize a function V λ : X × Y → R defined as V λ (x; y) := λ(Φ(x) -Φ * ) + (Φ(x) -f (x; y)), where λ > 0 is a constant. We borrow inspiration from Yang et al. (2020) and Das et al. (2022) to come up with this function, although the placement of λ of ours is different. In fact, the convergence to a neighborhood of a global minimax point (if it exists) implies the reduction of this function. For each x, a non-negative term Φ(x) -f (x; y) gets smaller as y makes f (x; y) larger. The term becomes zero when y = y * (x) for some y * (x) ∈ Y * x , since Φ(x) = f (x; y * (x)). Also, another non-negative term Φ(x) -Φ * gets smaller as x makes Φ(x) smaller. Thus, as (x; y) approaches to a minimax optimal point, V λ (x; y) decreases to near zero. In general, V λ is not guaranteed to attain exact zero, especially when the objective function f (x; y) is nonconvex in x (e.g., f is nonconvex-PŁ). Nevertheless, the potential function is still useful for deriving our convergence results.

3.2. MAIN THEOREMS: UPPER BOUNDS OF CONVERGENCE RATES

Now, we present our main results. We provide a detailed comparison of our theorems against existing results in Section 4. We present the full proof in Appendices C and D. We remark that both Theorems 1 and 2 are special cases (for mini-batch size b = 1) of their mini-batch extensions: Theorems 4 and 5 in Appendix A. Theorem 1 (Nonconvex-PŁ). Suppose that f satisfies Assumptions 1, 2, 3, and 4. Let κ 2 = L/µ 2 , where µ 2 is PŁ constant of -f (x; •) at all x. Let λ = 4. Choose the step sizes α and β such that β = min 1 6L n(n + A) , O V λ (z 1 0 ) Bn 2 K 1 3 and α = β r , for some r ≥ 14κ 2 2 . Then, both simSGDA-RR and altSGDA-RR (Algorithm 1) satisfy 1 K K k=1 E ∇Φ(x k 0 ) 2 ≤ O rLV λ (z 1 0 ) K 1+ A n +r L 2 BV λ (z 1 0 ) 2 nK 2 1 /3 . Upper bound on gradient complexity. To achieve ε-stationarity of the primal function, i.e., 1 K K k=1 E ∇Φ(x k 0 ) 2 ≤ ε 2 , a sufficient number of gradient evaluations (denoted by T ε = nK) is T ε = O rLV λ (z 1 0 ) ε 2 max n 2 + nA, √ rnB ε . Theorem 2 (PŁ(Φ)-PŁ). Suppose that f satisfies Assumptions 1, 2, 3, 4, and 5. Let κ 1 = L/µ 1 and κ 2 = L/µ 2 , where µ 1 and µ 2 are PŁ constants of Φ(•) and -f (x; •) (at all x), respectively. Let λ = 4. Choose appropriate step sizes α and β such that β = min 1 6L n(n + A) , Õ κ 2 2 µ 1 nK and α = β r , for some r ≥ 14κ 2 2 . Then, both simSGDA-RR and altSGDA-RR (Algorithm 1) satisfy E[V λ (z K+1 0 )] ≤ O   V λ (z 1 0 ) • exp   - K 12κ 1 r 1 + A n     + Õ κ 2 1 r 3 B µ 1 nK 2 . Upper bound on gradient complexity. To achieve ε 2 -accuracy on expectation of V λ (z K n ), i.e., E[V λ (z K n )] ≤ ε 2 , a sufficient number of gradient evaluations (denoted by T ε = nK) is T ε = max O κ 1 r n 2 + nA • log V λ (z 1 0 ) ε , Õ κ 1 r 3/2 ε nB µ 1 . Remark on step size ratio. In both theorems, we use the step sizes of ratio r = β/α κ 2 2 . It is common to use such a step size scheme r = Θ(κ 22 ) to analyze two-time-scale (S)GDA for nonconvex minimax problems (Jin et al., 2020; Lin et al., 2020; Yang et al., 2020) . Remark on the parameter λ. In our convergence analyses, we arbitrarily choose λ = 4 which makes the numerical calculations easier. The value of λ > 0 does not matter for the equivalence between the equation V λ (x * ; y * ) = 0 and global minimax condition (Proposition 11 in Appendix B). Also, the choice of λ in both theorems can be arbitrary as long as λ > 1; our logic does not fall apart if other appropriate step sizes for that λ are chosen. That is to say, we can show that the sequence V λ (z k 0 ) almost monotonically decreases, ignoring some small variance terms.

4.1. COMPARISON WITH STOCHASTIC WITH-REPLACEMENT SETTING

First of all, we confirm that SGDA with random reshuffling (RR) has faster convergence rates (i.e., fewer gradient computations) than SGDA based on with-replacement sampling. In particular, we compare our results with the analyses on the purely stochastic minimax settings which assume that every stochastic gradient oracle is independently sampled and unbiased: this assumption is naturally satisfied by with-replacement sampling for the finite-sum settings we consider. To make the comparisons fair and easy, we simply let r = β/α = Θ(κ 2 2 ), A = 0, and B = τ 2 . Lin et al. (2020, Theorem 4.5 ) present a convergence rate for with-replacement simSGDA with r = Θ(κ 22 ) run on nonconvex µ 2 -strongly-concave problems with a convex bounded constraint set Y for dual variable y. Their gradient complexity to achieve 1 T T t=1 E ∇Φ(x t ) 2 ≤ ε 2 (where T is the number of iterations) is written as T ε = O κ 2 2 L∆Φ+κ2L 2 D 2 ε 2 max 1, κ2τ 2 ε 2 , where κ 2 = L/µ 2 , ∆ Φ = Φ(x 0 ) -Φ * , D = diam Y, and τ 2 is the variance of the (unbiased) stochastic gradient oracles. Their complexity can be simplified as O(κ 3 2 τ 2 ε -foot_3 ), treating other factors as constants. In contrast, our Theorem 1 has a better gradient complexity in terms of ε and τ , thanks to shuffling: O κ 2 2 LV λ (z 1 0 ) ε 2 max n, κ 2 τ √ n ε , (Ours, from Theorem 1) or simply O(κ 3 2 τ √ nε -3 ). Thus, our gradient complexity for both simSGDA-RR and altSGDA-RR is better than that of with-replacement simSGDA when ε is small as ε ≤ O(τ / √ n). Our rate has three more strengths: (i) we do not require strong concavity in y, which is a strictly stronger assumption than requiring y-side PL condition; (ii) we do not require the constraint set Y to be bounded; (iii) our result can easily extend to the case of any mini-batch sizes, whereas Lin et al. (2020) need a particular choice of mini-batch size M = O(κ 2 τ 2 /ε) to ensure convergence. For nonconvex-PŁ objectives, Yang et al. (2022, Theorem 3.1) provide a convergence rate for withreplacement altSGDA with r = Θ(κ 22 ). Their rate can be translated to a gradient complexity for achieving 1 T T t=1 E ∇Φ(x t ) 2 ≤ ε 2 , written as O κ 2 2 LV λ (z0) ε 2 1 + κ 2 2 V λ (z0) 2 τ 2 ∆Φε 2 or simply O(κ 4 2 τ 2 ε -4 ). Therefore, our gradient complexity for both altSGDA-RR and simSGDA-RR is better when ε is small as ε ≤ O(κ 2 τ / √ n). For PŁ(Φ)-PŁ objectives, Yang et al. (2020, Theorem 3.3) to achieve E[V λ (z T )] ≤ ε 2 . One can apply the constant step sizes depending on the total number T of iterations to their analysis and derive a similar complexity with only deterioration in a logarithmic factor. In contrast, our gradient complexity for both sim/altSGDA-RR using constant step sizes can be written as, for small enough ε, Õ κ 1 κ 3 2 τ √ n ε √ µ 1 . (Ours, from Theorem 2) This is a better complexity in ε and κ 2 , especially when ε ≤ Õ κ 2 τ / √ nµ 1 , even without the requirement of diminishing step size.

4.2. COMPARISON WITH OTHER WORKS ON STOCHASTIC WITHOUT-REPLACEMENT SETTING

One of the most relevant works to this paper is Das et al. (2022, Theorem 3) . The authors obtain a similar convergence rate to us for the two-sided PŁ objective, based on linearization of gradients, but for a dissimilar algorithm which they refer to as AGDA-RR. The algorithm can be also thought of as epoch-wise-alternating SGDA-RR, whereas our algorithm (altSGDA-RR) can be called as stepwise-alternating SGDA-RR. In epoch k, their algorithm (i) performs updates only on x (x k 0 , . . . , x k n ) while fixing y to y k 0 , and then (ii) performs updates only on y (y k 0 , . . . , y k n ) while fixing x to x k+1 0 = x k n . We believe that our step-wise algorithm is closer to practice, especially when n is large. Because of the distinction between algorithms, the proof techniques are also different. Xie et al. (2021, Theorem 3) present a convergence rate of CD-MA, an extension of simSGDA to the cross-device federated learning setup, on nonconvex-PŁ setting. Their convergence result for CD-MA also assumes mini-batch sampling by random reshuffling. As a consequence, they yield a rate analogous to our Theorem 1 if we reduce their result to the single-machine setup. Nevertheless, our convergence bound contains a term that shrinks with the number of components or mini-batches, whereas theirs does not. For a more detailed comparison, please refer to Appendix H. There are also some works on RR-based (constrained) minimax optimization algorithms other than SGDA, but for convex-concave problems. Maheshwari et al. (2022) present OGDA-RR, a gradientfree RR-based optimistic GDA algorithm. Yu et al. (2022) study stochastic proximal point with RR, consisting of double-loop epochs. Their analyses exploit convex-concavity and Lipschitz continuity of their objective, based on the arguments by Nagaraj et al. (2019) . This enables a direct usage of the duality gap, the difference between primal function Φ(•) and dual function Ψ(•) = min x f (x; •), as a criterion for optimality. On the contrary, our work relies on a different structure of the functions, which in turn differentiates the constructions of convergence rates.

4.3. COMPARISON WITH DETERMINISTIC SETTING

Here, we compare our rates with (full-batch) gradient descent-ascent (GDA): x k = x k-1 -α∇ 1 f (x k-1 ; y k-1 ), y k = y k-1 + β∇ 2 f (x ; y k-1 ), where x = x k-1 , (simGDA), or x k , It uses the whole information of the objective f at every iteration without any noise. For comparison with GDA, we utilize our extended theorems for arbitrary mini-batch size b (Theorems 4 and 5 in Appendix A). By letting b = n and matching our iterate z k 0 = (x k 0 ; y k 0 ) to a GDA iterate z k = (x k ; y k ), our results reduce to upper convergence bounds for simGDA and altGDA. For nonconvex-PŁ problems (Theorems 1 & 4), the convergence rate and iteration complexity (i.e., sufficient number of iterations K ε ) become min k∈[K] ∇Φ(x k ) 2 ≤ O κ 2 2 LV λ (z 1 ) K ; i.e., K ε = O κ 2 2 LV λ (z 1 ) ε 2 , when r = Θ(κ 2 2 ). This is similar to a known rate of simGDA with r = Θ(κ 2 2 ) for nonconvexstrongly-concave problems by Lin et al. (2020, Theorem 4.4 ) as a special case. Their iteration complexity is written as O((κ 2 2 L∆ Φ + κ 2 L 2 D 2 )/ε 2 ) , where the symbols are already defined in Section 4.1. To see how the two bounds compare in terms of the factors other than ε, notice that we have Φ(x) -f (x; y) ≤ L 2 yy * (x) 2 for any (x; y), due to the L-smoothness of -f . Here, y * (x) is an element of Y * x = arg max y f (x; y). Thus, we have V λ (z 1 ) = λ[Φ(x 1 ) -Φ * ] + [Φ(x 1 ) -f (z 1 )] ≤ λ∆ Φ + LD 2 /2. As a result, we could loosely translate our iteration complexity (3)  to O((κ 2 2 L∆ Φ + κ 2 2 L 2 D 2 )/ε 2 ). We suspect that the discrepancy in terms of κ 2 comes from the fact that our analysis does not require the (strong) concavity in terms of y or a bounded constraint Y: these requirements made a considerable difference in proofs. For PŁ(Φ)-PŁ problems (Theorems 2 & 5), the rate and iteration complexity (K ε ) become V λ (z K+1 ) ≤ V λ (z 1 ) • exp - K Cκ 1 κ 2 2 ; i.e., K ε = O κ 1 κ 2 2 log(1/ε) where r = Θ(κ 2 2 ) and C is a numerical constant. This recovers the linear convergence by Yang et al. (2020, Theorem 3.2) as a special case, where they prove convergence of altGDA with step size ratio r = Θ(κ 2 2 ) for two-sided PŁ problem. Following the proof of (Yang et al., 2020, Theorem 3.2) , one can show that the bound (4) indeed implies the actual convergence to a global minimax point z * , in the sense that we can achieve z k -z * ≤ ε in O κ 1 κ 2 2 log(1/ε) iterations.

5. LOWER BOUND FOR (FULL-BATCH) SIMGDA USING SEPARATE STEP SIZES

As an extension of the discussion from Section 4.3, we characterize a lower complexity bound of deterministic simGDA with separate step sizes (α, β) of arbitrary ratio r = β/α, for smooth stronglyconvex-strongly-concave (SC-SC) cases. Surprisingly, at least for r κ 2 2 , our lower bound matches the upper complexity bound of GDA for a much wider class of smooth PŁ(Φ)-PŁ problems,foot_4 which is quite surprising. For a smooth PŁ(Φ)-PŁ problems, simGDA with at least r = Ω(κ 22 ) has an upper complexity bound K = O(κ 1 r log(1/ε)) for a global ε-convergence V λ (z K ) ≤ ε 2 in terms of potential function. This means that the lowest complexity is O(κ 1 κ 2 2 log(1/ )) achieved when r = Θ(κ 2 ). On the other hand, for a L-smooth µ-SC-SC problem with saddle point z * , it is well-known that the simGDA with a single step-size (α = β) has a tight upper/lower complexity K = Θ(κ 2 log(1/ε)) to achieve z K -z * 2 ≤ ε 2 , where κ = L/µ (e.g., Das et al. (2022, Theorem C.1) ). The difference of complexity bounds in condition number (κ 1 κ 2 2 v.s. κ 2 ) is somewhat questionable because, at least in smooth minimization problems, strongly convex problems and PŁ problems have identical gradient descent (GD) iteration complexity O(κ log(1/ε)) (Karimi et al., 2016, Theorem 1) . One could ask where the discrepancy in terms of κ comes from: is it due to (i) the criteria (V λ (z K ) v.s. z K -z * 2 ) for ε-accuracy, (ii) the function classes (PŁ(Φ)-PŁ v.s. SC-SC), or (iii) the step size ratios (Ω(κ 22 ) v.s. 1)? We answer the question by showing the following theorem: the discrepancy in κ comes from the step size ratio difference. We defer the proof to Appendix E. Theorem 3 (Lower bound, . Consider a class F(L, µ 1 , µ 2 ) of functions f (x; y) with two arguments x and y, which is L-smooth, µ 1 -strongly-convex in x, and µ 2 -strongly-concave in y. Suppose κ 1 = L/µ 1 ≥ c and κ 2 = L/µ 2 ≥ c for some constant c > 1. Then, for any step size ratio r = β/α > 0, there exists a function f ∈ F(L, µ 1 , µ 2 ) with a unique saddle point z * , for which simGDA with any step sizes (α, β) = (β/r, β) requires at least K = Ω (κ 1 r log(1/ε)) , if r ≥ κ 2 /c, Ω (κ 1 κ 2 log(1/ε)) , if c/κ 1 ≤ r ≤ κ 2 /c, Ω((κ 2 /r) log(1/ε)), if 0 < r ≤ c/κ 1 iterations to achieve either z K -z * 2 ≤ ε 2 or V λ (z K ) ≤ ε 2 . Thanks to Theorem 3, we can say from Theorem 5 that for any step size ratio r κ 2 2 , we have a tight upper bound on the iteration complexity K = O(κ 1 r log(1/ε)) of simGDA for general PŁ(Φ)-PŁ problems. Note that Theorem 3 also subsumes the existing lower bound of the equal-step-size (r = 1) simGDA for µ-SC-SC problems. Given the tightness of bounds for r κ 2 2 , a natural next step is to discuss 1 r κ 2 2 . Recent work by Li et al. (2022) also discusses the step size ratio of simGDA. In Li et al. (2022, Theorem 4.1) , the authors construct a y-side strongly-concave function 6 and show that simGDA with a step size ratio r ≤ κ 2 is impossible to converge. The necessity of r κ 2 implied by this theorem also applies to the PŁ(Φ)-PŁ case. Thus, there is no hope for showing an upper convergence bound of simGDA with 1 r κ 2 for general nonconvex-PŁ problems. We remark that their theorem does not contradict nor subsume Theorem 3 because we consider a much smaller function class (SC-SC) to construct the lower bounds. On the sufficiency of r κ 2 for convergence, Li et al. (2022, Theorem 4.2) show that simGDA with r ≥ cκ (for some c > 1) can locally converge at the iteration complexity O(κ 1 r log(1/ε)) for some nonconvex-strongly-concave problems, which matches the bound in Theorem 3. Our upper bounds (Theorems 4 and 5) do require r κ 2 2 , which may look suboptimal, but we claim that our results are not necessarily weaker. One reason is that our convergence guarantee is global, i.e., independent of the initialization. Another reason is that their analysis is only valid when a differential Stackelberg equilibriumfoot_6 exists, whereas a general PŁ(Φ)-PŁ function may not have such an equilibrium (for an example, see Proposition 13 in Appendix B). As far as we know, it is still an open problem whether a global convergence bound for simGDA on nonconvex-PŁ problems can be shown when the step size ratio r is between Ω(κ 2 ) and O(κ 22 ).

6. EXPERIMENTS

To validate our main theoretical findings, here we present some numerical results. We focus on the primal-PŁ-strongly-concave (or PŁ(Φ)-SC, which is PŁ(Φ)-PŁ as well) quadratic games of the form min x∈R d max y∈R d f (x; y) = 1 2 x Ax + x By -1 2 y Cy = 1 n n i=1 f i (x; y), where f i (x; y) = 1 2 x A i x + x B i y -1 2 y C i y + u i x -v i y. (5) This toy example is often used to numerically evaluate the minimax algorithms (Yang et al., 2020; Loizou et al., 2021; Das et al., 2022) and appears in various domains such as AUC maximization (Ying et al., 2016) , policy evaluation (Du et al., 2017) , and imitation learning (Cai et al., 2019) To make the game in Equation ( 5) satisfy PŁ(Φ)-SC and component L-smoothness, we should sample the coefficient matrices and vectors carefully. First, they need to be A i 2 , B i 2 , C i 2 ≤ L and n i=1 u i = n i=1 v i = 0. To make the primal function Φ a well-defined real-valued function C for an identity matrix I and µ > 0. Then, the primal function can be explicitly written as Φ(x) = max y∈R d f (x; y) = 1 2 x A + BC -1 B x := 1 2 x M x. We construct a matrix M := A + BC -1 B to be rank-deficient positive semi-definite. Letting the smallest nonzero eigenvalue of M by µ, we ensure that Φ is µ-PŁ but not strongly convex. We emphasize that the objective function f is not even (strongly-)convex in x in general. We compare six algorithms in total: simSGDA-RR, altSGDA-RR, AGDA-RR (as defined in Das et al. (2022) ), and the with-replacement counterparts of these three algorithms. To this end, on 5 different randomly-generated quadratic games and under 2 random seeds per game (i.e., 10 runs per algorithm), we run each algorithm for the same number of epochs using constant step sizes of ratio β/α = cκ 2 2 for some constant c and κ 2 = L/µ. We report the potential function values (V λ , defined in Equation ( 2)) at every iteration. 8 Results are presented in Figure 1 : the values are normalized by dividing them by the initial value. As we discussed in Section 4.1, we observe that the random reshuffling considerably accelerates the convergence of the algorithms. Furthermore, all three algorithms with random reshuffling show more or less the same performance. Specifically, the plots for simSGDA (resp. simSGDA-RR) and altSGDA (resp. altSGDA-RR) are almost identical. We believe this is because we choose a random seed for each of the 10 different runs and share it across different algorithms. Please refer to Appendix G for more detailed construction, discussion, and comparative study of the experimental results.

7. CONCLUSION

We investigated stochastic algorithms based on without-replacement component sampling, called simSGDA-RR and altSGDA-RR, for solving smooth nonconvex finite-sum minimax optimization problems. We established convergence rates under the y-side PŁ condition (nonconvex-PŁ) and, additionally, the primal PŁ condition (PŁ(Φ)-PŁ). We ascertain that the SGDA-RR can achieve a faster rate than its with-replacement counterpart, which agrees with the existing theory on withoutreplacement SGD for minimization. Lastly, we provided complexity lower bounds for simGDA with an arbitrarily fixed step size ratio r, demonstrating that the full-batch upper bound with r κ 2 2 for PŁ(Φ)-PŁ functions is tight. Possible future directions include widening our results beyond sim/altSGDA (e.g., extra-gradient or optimistic GDA) and beyond RR (e.g., single/adversarial shuffling). As also discussed in Section 5, an interesting open question remains open: can we identify tight convergence rates for stochastic (with-/without-replacement) and/or deterministic GDA with step size ratio r satisfying κ 2 r κ 2 2 , for general nonconvex-PŁ problems? 

A MINI-BATCH SGDA-RR AND CONVERGENCE RATES

In this appendix, we present an algorithm that extends simSGDA-RR and altSGDA-RR by using mini-batches of size b ≥ 1. For simplicity, we assume that the number of components n is an integer multiple of the mini-batch size b in our analysis; i.e., n = bq for some integer q ≥ 1. One can extend this to the case when n is not necessarily a multiple of b (e.g., n = b(q -1) + s, where q ≥ 1, s ∈ [b]) so that there are q -1 mini-batches of size b and one more mini-batch of size s ≤ b. Algorithm 2 Mini-batch simSGDA/altSGDA-RR 1: Given: The number of components n = b(q -1) + s (q: number of iterations per epoch); mini-batch size b; the number of epochs K; step sizes α, β > 0 2: Initialize: (x 1 0 ; y 1 0 ) ∈ R dx × R dy 3: for k ∈ [K] do 4: Sample σ k ∼ Unif(S n ) RR: uniformly randomly shuffle the indices every epoch 5: for t ∈ [q] do 6: B k t := {σ k (j) : b(t -1) < j ≤ bt, j ∈ [n]} Mini-batch : a set of component indices 7: x k t = x k t-1 -α b i∈B k t ∇ 1 f i (x k t-1 ; y k t-1 ) 8: if simSGDA-RR then 9: y k t = y k t-1 + β b i∈B k t ∇ 2 f i (x k t-1 ; y k t-1 ) simultaneous update: x & y 10: else if altSGDA-RR then 11: y k t = y k t-1 + β b i∈B k t ∇ 2 f i (x k t ; y k t-1 ) alternating update: x → y 12: (x k+1 0 ; y k+1 0 ) = (x k n/b ; y k n/b ) Next, we illustrate the generalized versions of our main results (Theorems 1 and 2) for Algorithm 2 with mini-batches of size b ≥ 1. Let us assume n ≥ 2 because the case n = 1 trivially boils down to simGDA or altGDA. We defer the proofs for simultaneous updates to Appendix C. We present the parts that change in the proof for alternating updates in Appendix D. Theorem 4 (Nonconvex-PŁ, mini-batch SGDA-RR). Suppose f satisfies Assumptions 1, 2, 3, and 4. Let λ = 4. Choose the step sizes α and β by α = β/r for some r ≥ 14κ 2 2 and β = b • min    1 6Ln 1 + n-b n-1 • A n , 1 c V λ (z 1 0 ) Ln 2 ( n-b n-1 )BK 1 3    , for some numerical constant c > 0. Then, mini-batch simSGDA-RR and altSGDA-RR with minibatch size b (a divisor of n) satisfy 1 K K k=1 E ∇Φ(x k 0 ) 2 ≤ 6rLV λ (z 1 0 ) K 1 + n -b n -1 A n + 2cr L 2 B V λ (z 1 0 ) 2 nK 2 • n -b n -1 1/3 . Theorem 5 (PŁ(Φ)-PŁ, mini-batch SGDA-RR). Suppose f satisfies Assumptions 1, 2, 3, 4, and 5. Let λ = 4. Choose the step sizes α and β by α = β/r for some r ≥ 14κ 2 2 and β = b • min    1 6Ln 1 + n-b n-1 • A n , 2r µ 1 nK max    1, log   V λ (z 1 0 )µ 1 nK 2 8c 3 κ 2 1 r 3 n-b n-1 B         , for some numerical constant c > 0. Then, mini-batch simSGDA-RR and altSGDA-RR with minibatch size b (a divisor of n) satisfy E[V λ (z K+1 0 )] ≤ O   V λ (z 1 0 ) • exp   - K 12κ 1 r 1 + n-b n-1 A n     + Õ κ 2 1 r 3 B µ 1 nK 2 • n -b n -1 . As a side remark, some works consider a sampling method called b-minibatch sampling where all the elements in each mini-batch are distinct (i.e., without-replacement component sampling per mini-batch), e.g., Loizou et al. (2021, Definition 2.1) . However, there is a significant gap between this method and ours: any two distinct mini-batches sampled by the b-minibatch sampling can intersect with each other (i.e., mini-batches are sampled with replacement), whereas, in each epoch of our Algorithm 2, all the mini-batches are mutually disjoint. Proposition 6 (κ ≥ 1). Let g be an L-smooth function which is bounded below by g * . Then, for any x, ∇g(x) 2 ≤ 2L [g(x) -g * ] . If g is µ-PŁ as well, then µ ≤ L. Consequently, the condition number κ := L/µ of g is ≥ 1. Proof. Since g is L-smooth, for any x and y, g * ≤ g(y) ≤ g(x) + ∇g(x), y -x + L 2 y -x 2 . ( ) Now define a convex quadratic function h x (y) of y as h x (y) := g(x) + ∇g(x), y -x + L 2 y -x 2 . Since its gradient is ∇h x (y) = ∇g(x) + L(y -x), y * := x -1 L ∇g(x) is a minimum of h x . Plugging y = y * to the equation ( 6), we get g * ≤ g(x) + ∇g(x), - 1 L ∇g(x) + L 2 - 1 L ∇g(x) 2 = g(x) - 1 2L ∇g(x) 2 . Rearranging the terms, ∇g(x) 2 ≤ 2L [g(x) -g * ] . If we additionally utilize PŁ inequality with g * := min g(x), ∇g(x) 2 ≥ 2µ [g(x) -g * ] , we directly yield µ ≤ L and thus κ = L/µ ≥ 1. Definition 1 (Karimi et al. (2016) ). Consider g : X → R. Let x p ∈ Π X * (x) be a projection of x onto the optimal set X * = arg min x∈X g(x). (1) We say g satisfies µ-strong convexity (SC ) if g(x ) ≥ g(x) + ∇g(x), x -x + µ 2 x -x 2 for any x, x ∈ X . (2) We say g satisfies µ-restricted secant inequality (RSI) if ∇g(x), xx p ≥ µ x p -x 2 for any x ∈ X . (3) We say g satisfies µ-error bound (EB) condition if ∇g(x) ≥ µ x p -x for any x ∈ X . (4) We say g satisfies µ-quadratic growth (QG) condition if g(x)-min x g(x ) ≥ µ 2 x p -x 2 for any x ∈ X . Proposition 7. From Definition 1, The following implications are true. • µ-SC implies µ-PŁ and µ-RSI. • µ-PŁ implies µ-QG and µ-EB. • µ-RSI implies µ-EB. • µ-EB and L-smoothness together imply (µ 2 /L)-PŁ. Proof. Most of the proofs originated from Karimi et al. (2016, Theorem 2) . (SC ⇒ PŁ) Substitute x to x p and x to x, respectively, from Definition 1.( 1). (PŁ ⇒ QG & EB) See the proof in Karimi et al. (2016, Theorem 2)  (SC ⇒ RSI) We know µ-SC ⇒ µ-PŁ ⇒ µ-QG. From Definition 1.(1) & 1.(4), ∇g(x), x -x p SC ≥ g(x) -g(x p ) + µ 2 x p -x 2 QG ≥ µ 2 x p -x 2 + µ 2 x p -x 2 = µ x p -x 2 . This implies µ-RSI. (RSI ⇒ EB) See the proof in Karimi et al. (2016, Theorem 2) . (EB & smooth ⇒ PŁ) We use ∇g(x p ) = 0. By L-smoothness and µ-EB condition, g(x) -g(x p ) smooth ≤ ∇g(x p ), x -x p + L 2 x -x p 2 = L 2 x -x p 2 EB ≤ L 2µ 2 ∇g(x) 2 . This implies (µ 2 /L)-PŁ condition on g. Proposition 8 (Lipschitz continuity-like property of y * (x)). For an L-smooth function g : X ×Y → R, suppose -g(x; •) is µ 2 -PŁ. Let κ 2 = L/µ 2 . Consider any x 0 , x 1 ∈ X . For any y * 0 ∈ Y * x0 = arg max y∈Y g(x 0 ; y), there exists a y * 1 ∈ Y * x1 = arg max y∈Y g(x 1 ; y) such that y * 0 -y * 1 ≤ κ 2 x 0 -x 1 . In fact, it is enough to choose y * 1 as a projection of y * 0 onto the set Y * x1 , namely, y * 1 ∈ Π Y * x 1 (y * 0 ). Proof. We borrow the proof from Nouiehed et al. (2019, Lemma A.3) . Recall Φ(x) := max y ∈Y g(x; y ). By PŁ inequality and smoothness of g, 2µ 2 (Φ(x 1 ) -g(x 1 ; y * 0 )) ≤ ∇ 2 g(x 1 ; y * 0 ) 2 = ∇ 2 g(x 1 ; y * 0 ) -∇ 2 g(x 0 ; y * 0 ) 2 ≤ L 2 x 1 -x 0 2 . The second equality applies ∇ 2 g(x 0 ; y * 0 ) = 0, since y * 0 ∈ arg max y g(x 0 ; y). Moreover, note that -g(x 1 ; •) satisfies µ 2 -QG condition (∵ Proposition 7). To apply this, we utilize our choice of y * 1 : Φ(x 1 ) -g(x 1 ; y * 0 ) ≥ µ 2 2 y * 1 -y * 0 2 . As a result, we have µ 2 2 y * 0 -y * 1 2 ≤ L 2 x 0 -x 1 2 . This completes the proof. Proposition 9 (Smoothness of primal function). Consider the same function g as Proposition 8. Then, the function Φ(x) := max y ∈Y g(x; y ) is differentiable with ∇Φ(x) = ∇ 1 g(x; y * (x)), regardless of the choice of y * (x) ∈ arg max y ∈Y g(x; y ). Moreover, Φ is L(κ 2 + 1)-smooth, where κ 2 = L/µ 2 . Proof. This is already proved in Lemma A.5 of Nouiehed et al. (2019) . However, we present a bit different proof without using second-order Taylor expansion. To start, recall Y * x := arg max y∈Y g(x; y). That is, we could choose any y * (x) ∈ Y * x . We first show the differentiability of Φ. Fix a unit vector u ∈ X = R dx : u = 1. Let any h > 0. We first claim that there exists a path p : (-h, h] → Y = R dy which is continuous at t = 0 and p(t) ∈ Y * (x+tu) . In fact, let p(t) be a projection of y * (x) (that we chose) onto the set Y * (x+tu) . Then, p(0) = y * (x), and by Proposition 8, we have p(0 ) -p(t) ≤ κ 2 x -(x + tu) = κ 2 t. This shows the continuity of p(t) at t = 0. Now, note that there exists a t 1 ∈ (0, h) such that, Φ(x + hu) -Φ(x) = g(x + hu; p(h)) -g(x; p(0)) = g(x + hu; p(h)) -g(x + hu; p(0)) + g(x + hu; p(0)) -g(x; p(0)) ≥ 0 + ∇ 1 g(x + t 1 u; p(0)), hu , by mean value theorem (applied to the first argument). We have the inequality at the last line because g(x + hu; p(h)) ≥ g(x + hu; p(0)), since p(h) ∈ Y * (x+hu) . With a similar logic, there exists a t 2 ∈ (0, h) such that, Φ(x + hu) -Φ(x) = g(x + hu; p(h)) -g(x; p(0)) = g(x + hu; p(h)) -g(x; p(h)) + g(x; p(h)) -g(x; p(0)) ≤ ∇ 1 g(x + t 2 u; p(h)), hu + 0. To combine these two inequalities into a single line, ∇ 1 g(x + t 1 u; p(0)), u ≤ Φ(x + hu) -Φ(x) h ≤ ∇ 1 g(x + t 2 u; p(h)), u . Using the continuity of p(•) and ∇ 1 g(•; •) (∵ g has Lipschitz continuous gradient), we can deduce that the directional derivative of Φ in a direction u (denoted by D u Φ) is in fact D u Φ(x) = ∇ 1 g(x; y * (x)), u , by taking the limit h → 0+. Since u is arbitrary, we can conclude that ∇Φ(x) = ∇ 1 g(x; y * (x)). The proof of Lipschitz smoothness of Φ exactly follows the proof by Nouiehed et al. (2019) . Consider any x 0 , x 1 ∈ X . As in Proposition 8, choose any y * 0 ∈ Y * x0 and y * 1 ∈ Π Y * x 1 (y * 0 ). Then, ∇Φ(x 0 ) -∇Φ(x 1 ) = ∇ 1 g(x 0 ; y * 0 ) -∇ 1 g(x 1 ; y * 1 ) ≤ ∇ 1 g(x 0 ; y * 0 ) -∇ 1 g(x 1 ; y * 0 ) + ∇ 1 g(x 1 ; y * 0 ) -∇ 1 g(x 1 ; y * 1 ) ≤ L { x 0 -x 1 + y * 0 -y * 1 } ≤ L(1 + κ 2 ) x 0 -x 1 . The last inequality holds because of Proposition 8. Proposition 10 (x-side PŁ ⇒ primal PŁ). Suppose g : X × Y → R is L-smooth and twosided PŁ with constants µ 1 and µ 2 . Then, g satisfies primal PŁ condition: the function Φ(x) := max y ∈Y g(x; y ) is µ 1 -PŁ. As a result, a smooth two-sided PŁ function is PŁ(Φ)-PŁ. Proof. See Lemma A.3 of Yang et al. (2020) . Definition 2. Consider g : X × Y → R. Then, the point (x * ; y * ) ∈ X × Y is called (i) a stationary point of g if ∇ 1 g(x * ; y * ) = ∇ 2 g(x * ; y * ) = 0. (ii) a saddle point of g if g(x * ; y) ≤ g(x * ; y * ) ≤ g(x; y * ) for all x, y. (iii) a global minimax point of g if g(x * ; y) ≤ g(x * ; y * ) ≤ max y g(x; y ) for all x, y. (iv) a global maximin point of g if min x g(x ; y) ≤ g(x * ; y * ) ≤ g(x; y * ) for all x, y. Proposition 11. Consider a function g : X × Y → R. (1) In general, a saddle point of g is a global minimax/maximin point. (2) Let Φ(x) := max y g(x; y) and Φ * := min x Φ(x) be well-defined. Let λ > 0 be a constant. In general, a point (x * ; y * ) is a global minimax point of g if and only if V λ (x * ; y * ) := λ[Φ(x) -Φ * ] + [Φ(x) -g(x; y)] = 0. (3) If g is smooth nonconvex-PŁ, then a global minimax point is a stationary point. (4) If g is PŁ(Φ)-PŁ, then there exists a global minimax point (x * ; y * ) of g. As a result, if g is also smooth, then the point (x * ; y * ) is a stationary point. (5) If g is smooth two-sided PŁ, every stationary point is a saddle point. As a result, there exists a saddle point (x * ; y * ) of g. In particular, smooth two-sided PŁ functions enjoy the "minimax theorem," which establishes "minimax = maximin." Proof. (2) (global minimax ⇐⇒ V λ = 0) The terms Φ(x) -Φ * and Φ(x) -g(x; y) are non-negative. Hence, V λ (x; y) is non-negative, and V λ (x * ; y * ) = 0 if and only if Φ * = Φ(x * ) = g(x * ; y * ), which is equivalent to the global minimax point condition. (3) (smooth nonconvex-PŁ: global minimax ⇒ stationary) Suppose (x * ; y * ) is a global minimax point. Since g(x * ; y) ≤ g(x * ; y * ) for any y, Φ(x * ) = max y g(x * ; y) = g(x * ; y * ). Thus, Φ has a minimum g(x * ; y * ) at x = x * . By Proposition 9, Φ(•) is a differentiable function and we have ∇ 1 g(x * ; y * ) = ∇Φ(x * ) = 0. Also, since a differentiable function g(x * ; y) has a maximum at y = y * , we also have ∇ 2 g(x * ; y * ) = 0. Therefore, (x * ; y * ) is a stationary point. (4) (PŁ(Φ)-PŁ: ∃ global minimax) Let x * ∈ arg min x Φ(x) and y * ∈ arg max y f (x * ; y). Then, f (x * , y * ) = Φ(x * ) = Φ * . as noted in ( 2), (x * , y * ) is a global minimax point. By (3) , it is in fact a stationary point, when g is smooth as well. (5) (smooth two-sided PŁ: stationary ⇒ saddle) Let (x * ; y * ) be a stationary point. By PŁ inequalities, for any x and y, 0 = ∇ 2 g(x * ; y * ) 2 ≥ 2µ 2 (max y g(x * ; y) -g(x * ; y * )) ≥ 0, 0 = ∇ 1 g(x * ; y * ) 2 ≥ 2µ 1 (g(x * ; y * ) -min x g(x; y * )) ≥ 0. Since µ 1 , µ 2 > 0, these imply max y g(x * ; y) = g(x * ; y * ) = min x g(x; y * ). Thus, (x * ; y * ) is a saddle point. Note that (4) and Proposition 10 together proves the existence of a stationary point of g. Therefore, there must exists a saddle point, which is also pointed out by Guo et al. (2020, Lemma 8) . This concludes the proof. We remark that, in the proof above, ( 3) is false for general (nonconvex-nonconcave) functions. Only local minimax point can ensure stationarity (Jin et al., 2020) . As remarked by Jin et al. (2020) (Figure 2 of their paper), the function xy-cos(y) has non-stationary global minimax points (0, ±π). The following two propositions are for showing that general two-sided PŁ function may not have a differential Stackelberg equilibrium defined as Li et al. (2022, Definition 3.1) . Proposition 12. Let g be a µ-strongly convex function on R n . Consider any matrix M ∈ R n×m with a positive rank. Suppose that θ is the smallest nonzero singular value of M . Then g(M y) is a µθ 2 -PŁ function of y ∈ R m . Proof. See Karimi et al. (2016, Appendix B) for the proof. Proposition 13. Consider a twice continuously differentiable strongly-convex-strongly-concave function h : R r × R s → R. That is, for some constants µ 1 , µ 2 > 0, h(x; y) is µ 1 -strongly-convex in x and -h(x; y) is µ 2 -strongly-convex in y. Let (x * ; y * ) be the unique stationary point of h. Of course, it is a differential Stackelberg equilibrium of h. That is, if the hessian matrix ∇ 2 h(x * ; y * ) at that point is written as ∇ 2 h(x * ; y * ) = ∇ 2 1,1 h(x * ; y * ) ∇ 2 1,2 h(x * ; y * ) ∇ 2 2,1 h(x * ; y * ) ∇ 2 2,2 h(x * ; y * ) = C B B -A , then A and C-BA -1 B are both positive definite matrices. Consider a function g : R p ×R q → R defined by g(x; y) = h(M x; N y) for some matrices M ∈ R r×p , N ∈ R s×q . Then, g is twosided PŁ. Moreover, each stationary point of g may not be a differential Stackelberg equilibrium in general, for example, when s < q. Proof. Because of Proposition 12, g is clearly a two-sided PŁ function. If (x; y) is a stationary point of g, then it must be an element of an affine set {(x; y) ∈ R p × R q : M x = x * ; N y = y * }. This is because ∇g(x; y) = ∇ 1 g(x; y) ∇ 2 g(x; y) = M ∇ 1 h(M x; N y) N ∇ 2 h(M x; N y) = 0 if and only if ∇ 1 h(M x; N y) = 0 and ∇ 2 h(M x; N y) = 0, being equivalent to M x = x * and N y = y * . Furthermore, the hessian of g at (x; y) is ∇ 2 g(x; y) = M ∇ 2 1,1 h(M x; N y)M M ∇ 2 1,2 h(M x; N y)N N ∇ 2 2,1 h(M x; N y)M N ∇ 2 2,2 h(M x; N y)N = M CM M BN (M BN ) -N AN . If s < q, the q × q matrix N AN cannot have a full rank, thereby it cannot be even invertible. This implies the stationary point (x; y) cannot be a differential Stackelberg equilibrium.

B.2 WITHOUT-REPLACEMENT SAMPLING

In this subsection, we provide a useful proposition for analysis of mini-batching approach under without-replacement sampling. We consider the case of mutually disjoint mini-batches in a whole epoch, not only applying without-replacement sampling to each individual mini-batch. Consider a collection of n vectors v 1 , . . . , v n ∈ R d . Suppose we uniformly randomly sample a permutation σ : [n] → [n]; i.e., σ ∼ Unif(S n ). Define m = 1 n n i=1 v i (sample mean) and τ 2 = 1 n n i=1 v i -m 2 (sample variance). Fix any b ∈ [n] and let n = b(q -1) + s for some integers q ≥ 1 and s ∈ [b]. Now, divide the indices [n] into q batches, with exactly b items per batch (except for the last batch when s < b), as follows: W t = {σ(j) : b(t -1) < j ≤ bt, j ∈ [n]} (t ∈ [q]). For each batch W t , define w t = 1 |W t | i∈Wt v i (batch mean). For any k ∈ [q -1], define m k := 1 k k t=1 w t (accumulative average of batch means over 1 ≤ t ≤ k). Of course, we may simply take m q = m (deterministically) for k = q. Thus, because of the randomness of σ, we can obtain the mean (vector) and the variance (scalar) of m k as follows. Proposition 14 (Without-replacement sampling). Given the setup above, for any k < q and n > 1, E[m k ] = m and E m k -m 2 = (n -bk) bk(n -1) τ 2 . (Of course, if k = q or n = 1 = q, E[ m q -m 2 ] = 0 since m q = m.) Remark. As a special case, if n = bq (namely, b divides n and s = b), then for any k ≤ q, E m k -m 2 = (q -k) k(n -1) τ 2 . If we further assume b = s = 1 and q = n, this proposition recovers Lemma 1 of Mishchenko et al. (2020) . Proof of Proposition 14. Since σ is a uniformly randomly sampled permutation, it is easy to obtain that E[v σ(i) ] = E[w t ] = E[m k ] = m, for any i ∈ [n], t ∈ [q], and k ∈ [q]. The covariances between v σ(i) 's can be deduced from the proof by Mishchenko et al. (2020, Lemma 1) as follows: Cov(v σ(i) , v σ(j) ) := E v σ(i) -m, v σ(j) -m = -τ 2 n-1 , if i = j, τ 2 if i = j. Thus, for each t ∈ [q], the variance of w t is obtained as E w t -m 2 = E   1 |W t | i∈Wt (v i -m) 2   = 1 |W t | 2        i∈Wt E v i -m 2 + i,j∈Wt i =j Cov(v i , v j )        = 1 |W t | 2 |W t |τ 2 + |W t |(|W t | -1) - τ 2 n -1 = n -|W t | |W t |(n -1) τ 2 , which can also be directly deduced by Lemma 1 of Mishchenko et al. (2020) . We notice that this does not depends on the size of the batch W t . Next, we look at the covariances between distinct w t 's. For a pair of distinct integers t, u ∈ [q], by the bi-linearity of covariance, Cov(w t , w u ) = 1 |W t | • |W u | (i,j)∈Wt×Wu Cov(v i , v j ) = 1 |W t | • |W u | (i,j)∈Wt×Wu - τ 2 n -1 = - τ 2 n -1 . The second equality holds because W t and W u are a disjoint set of integers whenever t = u. Now, fix any k ∈ [q -1]. Note that, by our mini-batching strategy, |W t | = b for every t < q. Therefore, by definition of m k , E m k -m 2 = E   1 k k t=1 (w t -m) 2   = 1 k 2        k t=1 E w t -m 2 + t,u∈[k] t =u Cov(w t , w u )        = 1 k 2 k • n -b b(n -1) τ 2 + k(k -1) • - τ 2 n -1 = n -bk bk(n -1) τ 2 .

B.3 BASIC RECURRENCE INEQUALITY

In this subsection, we present a basic result of a recurrence inequality. It serves as a stepping-stone of our convergence bound, particularly at the end of the proof (Appendix C.5). Proposition 15. Let {a k } ∞ k=1 be a sequence of non-negative numbers satisfying the following recurrence inequality: a k+1 ≤ (1 -bη)a k + cη m+1 , where b, c, and η are non-negative real numbers such that bη ∈ (0, 1), and m is a non-negative integer. Then, for any integer K ≥ 1, we have a K+1 ≤ (1 -bη) K a 1 + cη m /b. Proof. We proceed with induction on K = 0, 1, 2, • • • . Note that a 1 ≤ (1 -bη) 0 a 1 + cη m /b. This shows the case when K = 0. On the other hand, if K ≥ 1, by an inductive assumption, a K+1 ≤ (1 -bη)a K + cη m+1 ≤ (1 -bη) • (1 -bη) K-1 a 1 + cη m /b + cη m+1 = (1 -bη) K a 1 + cη m /b.

C PROOFS FOR (MINI-BATCH) SIMULTANEOUS SGDA-RR

In this appendix, we provide a convergence analysis for the mini-batch simSGDA-RR (Algorithm 2) on both general nonconvex-PŁ problems and primal-PŁ-PŁ problems. The two cases mostly share the same proof strategies; they only diverge at the end of the proofs. The proof is long; we first provide the sketch of proof in subsection C.1; then, we provide the full proof by dividing it into 4 followup subsections of this appendix. The proof for the alternating counterpart (minibatch altSGDA-RR) can be done with some modifications illustrated in Appendix D. All technical propositions required for the proofs can be found in Appendix B.

C.1 WARM-UP: PROOF SKETCH FOR b = 1

Here we simply consider the proofs of Theorem 1 and 2 for simSGDA-RR, which is a fully stochastic case (mini-batches of size b = 1). The proofs for altSGDA-RR can be done with slight modifications. We start the proof by aggregating all updates throughout an epoch to obtain an "epoch-wise" update: x k+1 0 = x k 0 -nαg k , g k = 1 n n i=1 ∇ 1 f σ k (i) (z k i-1 ), y k+1 0 = y k 0 + nβh k , h k = 1 n n i=1 ∇ 2 f σ k (i) (z k i-1 ). The reason is that the sampled components in each epoch are dependent to each other so that it is much harder to deal with each iteration individually. The strategy of update-aggregation is quite general for analysis of optimization algorithms involving without-replacement sampling (Ahn et al., 2020; Mishchenko et al., 2020; Nguyen et al., 2021; Das et al., 2022) . We assume that the intermediate iterates z k 1 , . . . , z k n stay close to the starting iterate z k 0 of an epoch k, which can be ensured by small step sizes. Then, we can approximate the aggregated epoch of SGDA-RR as a step of simGDA applied to f = 1 n n i=1 f i , with approximations of g k ≈ ∇ 1 f (z k 0 ) and h k ≈ ∇ 2 f (z k 0 ). With Assumptions 1 and 4, note that the primal function Φ(•) is (L+L 2 /µ 2 )-smooth (Proposition 9). Applying this and L-smoothness of -f , we can have the following inequality (Lemma 16): V λ (z k+1 0 ) -V λ (z k 0 ) ≤ -((λ + 1)/2) nα ∇Φ(x k 0 ) 2 + (λ + 1)nα ∇Φ(x k 0 ) -∇ 1 f (z k 0 ) 2 + (nα/2) ∇ 1 f (z k 0 ) 2 -(nβ/2) ∇ 2 f (z k 0 ) 2 + (λ + 1/2) nα g k -∇ 1 f (z k 0 ) 2 + (nβ/2) h k -∇ 2 f (z k 0 ) 2 . Hence, to guarantee the fast decrease of V λ (z k 0 ), it is important to control the "noise" terms for GDA approximations, g k -∇ 1 f (z k 0 ) 2 and h k -∇ 2 f (z k 0 ) 2 , in the last line of inequality above. By applying the tools for without-replacement sampling (Proposition 14), we can actually upper-bound the conditional expectations of both noise terms by 2L 2 n(n + A) α 2 ∇ 1 f (z k 0 ) 2 + β 2 ∇ 2 f (z k 0 ) 2 + 2L 2 n(α 2 + β 2 )B. (Lemma 17 & 18) Then, by taking advantage of several properties of smooth nonconvex-PŁ functions (e.g., Propositions 7, 8, and 9) and some small-step-size assumptions (e.g., β = O(1/nL), β/α = r κ 2 2 ), we eventually have E V λ (z k+1 0 ) -E V λ (z k 0 ) ≤ -nαE ∇Φ(x k 0 ) 2 -(Lκ 2 nα/2)E Φ(x k 0 ) -f (z k 0 ) + Cα 3 , where C ≥ 0 is a constant (with respect to k) depending on L, n, B, and r = β/α. (Lemma 20). We note that the step size ratio r κ 2 2 is crucial for showing that the coefficient in front of the term E Φ(x k 0 ) -f (z k 0 ) is non-positive: even if it is possible with r κ 2 2 , we must assume that κ 2 upper-bounded by a positive numerical constant, which is not desirable for showing convergence bounds. Thus, we expect that a different proof strategy should be applied to avoid the requirement r κ 2 2 on the step size ratio. The proofs of Theorems 1 and 2 diverge from here. The rest of the proof is mostly about choosing appropriate step sizes and solving the recurrence inequalities. The full proof of Theorems 4 and 5 starts from the following subsection.

C.2 EPOCH-WISE REPRESENTATIONS AND BOUNDING NOISE TERMS

Before starting the proof, we again remark that we assume that the mini-batch size b divides the number of components n (namely, q := n/b is a positive integer) for simplicity: thus, readers who want to read proofs for fully stochastic case (i.e., b = 1) can substitute n to every q. Also, there is no problem in treating any fraction with a positive numerator and a zero denominator as +∞. Moreover, we simply regard (q -1)/(n -1) = 1 when n = 1. We start the proof by aggregating all updates throughout an epoch to obtain an "epoch-wise" update equation. The reason is that the sampled components in each epoch depend on each other, so it is much harder to deal with each iteration individually. At iteration t ∈ [n/b] = [q] of epoch k ∈ [K], we use a mini-batch B k t := {σ k (j) : b(t -1) < j ≤ bt, j ∈ [n]}. To ease the analysis of Algorithm 2, define the following sums associated with (partial) gradient oracles at a point z = (x; y) over the mini-batch: g k t (z) := 1 b i∈B k t ∇ 1 f i (z), h k t (z) := 1 b i∈B k t ∇ 2 f i (z). By Assumption 1, g k t and h k t are L-Lipschitz continuous. Computing the average of them over a whole epoch (z k 0 , • • • , z k q-1 ), we define g k := 1 q q t=1 g k t (z k t-1 ), h k := 1 q q t=1 h k t (z k t-1 ). Then, by summing up the updates in the epoch k, we can summarize the epoch as follows. x k+1 0 = x k 0 -qαg k , y k+1 0 = y k 0 + qβh k . (simSGDA-RR) We may assume that the intermediate iterates z k 1 , . . . , z k q stay close to the starting iterate z k 0 of an epoch k, which results from, e.g., small step sizes. Then, we can approximate the aggregated epoch of SGDA-RR as a step of simGDA applied to f = 1 n n i=1 f i : g k ≈ ∇ 1 f (z k 0 ), h k ≈ ∇ 2 f (z k 0 ). In other words, x k+1 0 ≈ x k 0 -qα∇ 1 f (z k 0 ), y k+1 0 ≈ y k 0 + qβ∇ 2 f (z k 0 ), With Assumptions 1, 3 and 4, we can yield a naive (but complicated) upper bound of the gap V λ (z k+1 0 ) -V λ (z k 0 ), only applying the smoothness of Φ and -f , without any assumptions on step sizes. Lemma 16. Suppose that Assumptions 1, 3 and 4 hold. Let κ 2 = L/µ 2 , where µ 2 is PŁ constant of -f (x; •). Then, the mini-batch simSGDA-RR satisfies that V λ (z k+1 0 ) -V λ (z k 0 ) ≤ - λ + 1 2 qα ∇Φ(x k 0 ) 2 + (λ + 1)qα ∇Φ(x k 0 ) -∇ 1 f (z k 0 ) 2 + qα 2 ∇ 1 f (z k 0 ) 2 - qβ 2 ∇ 2 f (z k 0 ) 2 + λ + 1 2 qα g k -∇ 1 f (z k 0 ) 2 + qβ 2 h k -∇ 2 f (z k 0 ) 2 -λ -{(λ + 1)(κ 2 + 1) + 1} Lqα qα 2 g k 2 -(1 -Lqβ) qβ 2 h k 2 . ( ) Proof. By definition of V λ , the following equation holds: V λ (z k+1 0 ) -V λ (z k 0 ) = (λ + 1) Φ(x k+1 0 ) -Φ(x k 0 ) + f (z k 0 ) -f (z k+1 0 ) . First, we seek for an upper bound of Φ(x k+1 0 ) -Φ(x k 0 ). By Proposition 9, Φ is L(κ 2 + 1)-smooth. Hence, we have Φ(x k+1 0 ) -Φ(x k 0 ) ≤ ∇Φ(x k 0 ), x k+1 0 -x k 0 + L(κ 2 + 1) 2 x k+1 0 -x k 0 2 = -qα ∇Φ(x k 0 ), g k + L(κ 2 + 1) 2 q 2 α 2 g k 2 = - qα 2 ∇Φ(x k 0 ) 2 + g k 2 -∇Φ(x k 0 ) -g k 2 + L(κ 2 + 1) 2 q 2 α 2 g k 2 = - qα 2 ∇Φ(x k 0 ) 2 + qα 2 ∇Φ(x k 0 ) -g k 2 - qα 2 (1 -L(κ 2 + 1)qα) g k 2 ≤ - qα 2 ∇Φ(x k 0 ) 2 + qα ∇Φ(x k 0 ) -∇ 1 f (z k 0 ) 2 + qα g k -∇ 1 f (z k 0 ) 2 - qα 2 (1 -L(κ 2 + 1)qα) g k 2 . ( ) The third line is due to polarization equalityfoot_8 and the last inequality applies Young's inequality.foot_9  Next, applying Assumption 1, L-smoothness of -f ( •; •) yields an upper bound of f (z k 0 ) -f (z k+1 0 ). f (z k 0 ) -f (z k+1 0 ) ≤ -∇f (z k 0 ), z k+1 0 -z k 0 + L 2 z k+1 0 -z k 0 2 = -∇ 1 f (z k 0 ), x k+1 0 -x k 0 -∇ 2 f (z k 0 ), y k+1 0 -y k 0 + L 2 x k+1 0 -x k 0 2 + L 2 y k+1 0 -y k 0 2 = qα ∇ 1 f (z k 0 ), g k -qβ ∇ 2 f (z k 0 ), h k + L 2 q 2 α 2 g k 2 + L 2 q 2 β 2 h k 2 = qα 2 ∇ 1 f (z k 0 ) 2 - qα 2 g k -∇ 1 f (z k 0 ) 2 + qα 2 (1 + Lqα) g k 2 - qβ 2 ∇ 2 f (z k 0 ) 2 + qβ 2 h k -∇ 2 f (z k 0 ) 2 - qβ 2 (1 -Lqβ) h k 2 . ( ) The last equality is due to polarization equality. Lastly, substituting ( 9) and ( 10) to ( 8) finishes the proof. We remark that the last two terms of the inequality (7) can be simply ignored by applying small enough step sizes. However, the terms in the third line of ( 7) are non-negatives terms related to the "noise" of approximation g k ≈ ∇ 1 f (z k 0 ), h k ≈ ∇ 2 f (z k 0 ). Hence, it is important to control the noise terms g k -∇ 1 f (z k 0 ) 2 and h k -∇ 2 f (z k 0 ) 2 to guarantee a fast decrease of V λ (z k 0 ). Lemma 17. For mini-batch simSGDA-RR, define G k := 1 q q t=1 z k t-1 -z k 0 2 . ( ) With Assumption 1, then g k -∇ 1 f (z k 0 ) 2 ≤ L 2 G k and h k -∇ 2 f (z k 0 ) 2 ≤ L 2 G k . As a side remark, G k = 0 when q = 1 and, in particular, n = 1. Proof. Recall that 1 q q t=1 g k t (z) = ∇ 1 f (z) and 1 q q t=1 h k t (z) = ∇ 2 f (z) . By Lipschitz continuity and Jensen's inequality, 11 g k -∇ 1 f (z k 0 ) 2 = 1 q q t=1 g k t (z k t-1 ) -g k t (z k 0 ) 2 ≤ 1 q q t=1 g k t (z k t-1 ) -g k t (z k 0 ) 2 ≤ L 2 q q t=1 z k t-1 -z k 0 2 . Similarly, h k -∇ 2 f (z k 0 ) 2 ≤ L 2 q q t=1 z k t-1 -z k 0 2 . This concludes the proof. Thanks to the lemma, it suffices to bound the term G k . One can notice that it also represents how far the intermediate iterates z k t are from the pivot z k 0 in average. Before moving on, we define an algorithm-specific symbol denoting a conditional expectation. Definition 3. We denote a conditional expectation of a random variable X given all iterates of the first k - 1 epochs by E k [X] = E[X|z 1 0 , z 1 1 , . . . , z k-1 n ]. In particular, if k = 1, it boils down to a conditional expectation given only the initial iterate z 1 0 . We get an upper bound of a (conditional) expectation E k [G k ] in the following lemma, which extends a lemma of Nguyen et al. (2021, Lemma 6) to our minimax problems. Lemma 18. Suppose that Assumptions 1 and 2 hold. Assume that the permutation σ k is sampled uniformly at random from S n . Then, for any step sizes α, β satisfying α 2 + β 2 ≤ 1 3q(q-1)L 2 , the iterates {z k t } q-1 t=0 of the k-th epoch of mini-batch simSGDA-RR satisfies (for n > 1) E k G k ≤ 2 q 2 + q(q -1) n -1 A α 2 ∇ 1 f (z k 0 ) 2 + β 2 ∇ 2 f (z k 0 ) 2 + 2q(q -1) n -1 (α 2 + β 2 )B. Proof. Note that G k = 0 when q = 1 by its definition. From now, we may assume q > 1 and n > 1 in this proof. By summing the first t ∈ [q -1] updates of the k-th epoch of mini-batch simSGDA-RR, we have x k t = x k 0 -tα   1 t t j=1 g k j (z k j-1 )   , y k t = y k 0 + tβ   1 t t j=1 h k j (z k j-1 )   . 11 For any n vectors a1, • • • , an, 1 n n j=1 aj 2 ≤ 1 n n j=1 aj 2 . Then we can bound the following squared distance. x k t -x k 0 2 = α 2 t 2 1 t t j=1 g k j (z k j-1 ) 2 ≤ 3α 2 t 2    1 t t j=1 g k j (z k j-1 ) -g k j (z k 0 ) 2 + 1 t t j=1 g k j (z k 0 ) -∇ 1 f (z k 0 ) 2 + ∇ 1 f (z k 0 ) 2    ≤ 3α 2 t t j=1 g k j (z k j-1 ) -g k j (z k 0 ) 2 + 3α 2 t 2    1 t t j=1 g k j (z k 0 ) -∇ 1 f (z k 0 ) 2 + ∇ 1 f (z k 0 ) 2    ≤ 3α 2 L 2 t • t j=1 z k j-1 -z k 0 2 + 3α 2 t 2    1 t t j=1 g k j (z k 0 ) -∇ 1 f (z k 0 ) 2 + ∇ 1 f (z k 0 ) 2    ≤ 3α 2 L 2 t • qG k + 3α 2 t 2    1 t t j=1 g k j (z k 0 ) -∇ 1 f (z k 0 ) 2 + ∇ 1 f (z k 0 ) 2    . ( ) The second and third lines are due to Jensen's inequality. The fourth line is due to L-Lipschitz continuity of g k j . Likewise, y k t -y k 0 2 ≤ 3β 2 L 2 t • qG k + 3β 2 t 2    1 t t j=1 h k j (z k 0 ) -∇ 2 f (z k 0 ) 2 + ∇ 2 f (z k 0 ) 2    . ( ) Summing up ( 12) and ( 13), z k t -z k 0 2 = x k t -x k 0 2 + y k t -y k 0 2 ≤ 3(α 2 + β 2 )L 2 tqG k + 3α 2 t 2    1 t t j=1 g k j (z k 0 ) -∇ 1 f (z k 0 ) 2 + ∇ 1 f (z k 0 ) 2    + 3β 2 t 2    1 t t j=1 h k j (z k 0 ) -∇ 2 f (z k 0 ) 2 + ∇ 2 f (z k 0 ) 2    . Taking (conditional) expectation E k (given z k 0 ) to inequality ( 14), E k z k t -z k 0 2 (14) ≤ 3(α 2 + β 2 )L 2 tq • (E k [G k ]) + 3α 2 t 2 ∇ 1 f (z k 0 ) 2 + 3β 2 t 2 ∇ 2 f (z k 0 ) 2 + 3α 2 t 2 E k 1 t t j=1 g k j (z k 0 ) -∇ 1 f (z k 0 ) 2 + 3β 2 t 2 E k 1 t t j=1 h k j (z k 0 ) -∇ 2 f (z k 0 ) 2 . ( ) Here we take advantage of the without-replacement sampling. Putting ∇ s f i (z k 0 ) → v i (s ∈ {1, 2}), one can realize a correspondence between the quantities that arise from our algorithm and the symbols in Appendix B.2: for s = 1 (∇ 1 f i (z k 0 ) → v i ), m = ∇ 1 f (z k 0 ), τ 2 ≤ A ∇ 1 f (z k 0 ) 2 + B, w t = g k t (z k 0 ), m t = 1 t t j=1 g k j (z k 0 ), and for s = 2 (∇ 2 f i (z k 0 ) → v i ), m = ∇ 2 f (z k 0 ), τ 2 ≤ A ∇ 2 f (z k 0 ) 2 + B, w t = h k t (z k 0 ), m t = 1 t t j=1 h k j (z k 0 ). The upper bounds of τ 2 's come from Assumption 2. Then by Proposition 14, for any t ≤ q, t 2 E k 1 t t j=1 g k j (z k 0 ) -∇ 1 f (z k 0 ) 2 ≤ t(q -t) n -1 A ∇ 1 f (z k 0 ) 2 + B , t 2 E k 1 t t j=1 h k j (z k 0 ) -∇ 2 f (z k 0 ) 2 ≤ t(q -t) n -1 A ∇ 2 f (z k 0 ) 2 + B . Putting these to the inequality ( 15), E k z k t -z k 0 2 ≤ 3(α 2 + β 2 ) L 2 tqE k [G k ] + t(q -t) n -1 B + 3 α 2 ∇ 1 f (z k 0 ) 2 + β 2 ∇ 2 f (z k 0 ) 2 t 2 + t(q -t) n -1 A . Taking an average of the inequality above over 0 ≤ t ≤ q -1, E k G k = 1 q q-1 t=0 E k z k t -z k 0 2 ≤ 3q(q -1) 2 (α 2 + β 2 )L 2 E k G k + (α 2 + β 2 ) q 2 -1 2(n -1) B + α 2 ∇ 1 f (z k 0 ) 2 + β 2 ∇ 2 f (z k 0 ) 2 (q -1)(2q -1) 2 + q 2 -1 2(n -1) A , ( ) where we used the facts q-1 t=0 t = q(q-1)

2

, 1 q q-1 t=0 t 2 = (q-1)(2q-1)

6

, and 1 q q-1 t=0 t(q-t) n-1 = q 2 -1 6(n-1) . Since we assumed α 2 + β 2 ≤ 1 3q(q-1)L 2 , we have 1 ≤ 2 1 -3q(q-1)L 2 2 (α 2 + β 2 ) . Using this, E k G k ≤ 2 1 - 3q(q -1)L 2 2 (α 2 + β 2 ) E k G k (16) ≤ (q -1)(2q -1) + q 2 -1 (n -1) A α 2 ∇ 1 f (z k 0 ) 2 + β 2 ∇ 2 f (z k 0 ) 2 + q 2 -1 n -1 (α 2 + β 2 )B ≤ 2 q 2 + q(q -1) n -1 A α 2 ∇ 1 f (z k 0 ) 2 + β 2 ∇ 2 f (z k 0 ) 2 + 2q(q -1) n -1 (α 2 + β 2 )B, where the last inequality used (q -1)(2q -1) ≤ 2q 2 and q + 1 ≤ 2q for q ≥ 1.

C.3 RECURRENCE INEQUALITIES FOR GENERAL SMOOTH NONCONVEX-PŁ OBJECTIVE

Subsequently, we obtain recurrence inequalities about (expected) potential function E k [V λ (z k 0 )] for nonconvex-PŁ problem. Since primal-PŁ-PŁ problem is a subclass of nonconvex-PŁ problem, the recurrence relations can serve as stepping-stones of our convergence rates. We introduce some assumptions on small step sizes which enable us to get rid of a few troublesome terms from our bound. On top of that, combining the PŁ condition (Assumption 4) with Lemmas 16, 17, and 18, we eventually obtain a much more concise bound on the expected perepoch change of V λ . This simple recurrence inequality becomes the key to proving our convergence bounds. Lemma 19. Suppose that Assumptions 1, 2, 3, and 4 hold. Assume that the step sizes α and β satisfy α ≤ λ {(λ + 1)(κ 2 + 1) + 1}qL , β ≤ 1 qL , α 2 + β 2 ≤ 1 3q(q -1)L 2 , ( ) and the condition C 0 := qβ -2L 2 q q 2 + q(q -1) n -1 A ((2λ + 1)α + β) β 2 ≥ 0 as well. Then, the iterates of mini-batch simSGDA-RR satisfy E k [V λ (z k+1 0 )] -V λ (z k 0 ) ≤ -C 1 ∇Φ(x k 0 ) 2 -C 2 Φ(x k 0 ) -f (z k 0 ) + C 3 where C 1 = λ -1 2 qα -2L 2 q q 2 + q(q -1) n -1 A (2λ + 1)α + β α 2 , C 2 = µ 2 C 0 -2(λ + 2)Lκ 2 qα -4L 3 κ 2 q q 2 + q(q -1) n -1 A (2λ + 1)α + β α 2 = µ 2 qβ -2(λ + 2)Lκ 2 qα -2L 2 µ 2 q q 2 + q(q -1) n -1 A (2λ + 1)α + β 2κ 2 2 α 2 + β 2 , C 3 = L 2 q 2 (q -1) n -1 ((2λ + 1)α + β) (α 2 + β 2 )B. Proof. The first two inequalities of ( 17) eliminate the last two terms on the right-hand side of the inequality in Lemma 16. In addition, applying Lemma 17 to Lemma 16 as well, we have V λ (z k+1 0 ) -V λ (z k 0 )≤ - λ + 1 2 qα ∇Φ(x k 0 ) 2 + (λ + 1)qα ∇Φ(x k 0 ) -∇ 1 f (z k 0 ) 2 + qα 2 ∇ 1 f (z k 0 ) 2 - qβ 2 ∇ 2 f (z k 0 ) 2 + (2λ + 1)α + β 2 qL 2 G k . ( ) If we take the conditional expectation E k and apply Lemma 18 (which requires the third inequality of (17) to hold) to ( 18) E k [V λ (z k+1 0 )] -V λ (z k 0 ) ≤ - λ + 1 2 qα ∇Φ(x k 0 ) 2 + (λ + 1)qα ∇Φ(x k 0 ) -∇ 1 f (z k 0 ) 2 + 1 2 qα + 2L 2 q q 2 + q(q -1) n -1 A ((2λ + 1)α + β) α 2 ∇ 1 f (z k 0 ) 2 - 1 2 qβ -2L 2 q q 2 + q(q -1) n -1 A ((2λ + 1)α + β) β 2 C0 ∇ 2 f (z k 0 ) 2 + L 2 q 2 (q -1) n -1 ((2λ + 1)α + β) (α 2 + β 2 )B C3 . ( ) It is now left to bound terms in (19) using the tools developed so far. First, recall that Φ(x) := max y ∈Y f (x; y ). Since -f (x; y) is µ 2 -PŁ in y, we have -∇ 2 f (z k 0 ) 2 ≤ -2µ 2 (Φ(x k 0 ) -f (z k 0 )). ( ) Given any x, ∇Φ(x) = ∇ 1 f (x; y * (x)) for any y * (x) ∈ arg max y ∈Y f (x; y ) by Proposition 9. Besides, -f (x; •) satisfies QG condition with constant µ 2 by Proposition 7. Thus, by choosing y * (x k 0 ) to be the projection of y k 0 onto arg max y ∈Y f (x k 0 ; y ), ∇Φ(x k 0 ) -∇ 1 f (z k 0 ) 2 ≤ L 2 y * (x k 0 ) -y k 0 2 ≤ 2Lκ 2 Φ(x k 0 ) -f (z k 0 ) . ( ) Here, the first inequality applies L-Lipschitz continuity of ∇ 1 f (x k 0 ; •), implied by Assumption 1. On top of that, applying the Young's inequality to the term ∇ 1 f (z k 0 ) 2 , ∇ 1 f (z k 0 ) 2 ≤ 2 ∇Φ(x k 0 ) 2 + 2 ∇Φ(x k 0 ) -∇ 1 f (z k 0 ) 2 (21) ≤ 2 ∇Φ(x k 0 ) 2 + 4Lκ 2 Φ(x k 0 ) -f (z k 0 ) By applying inequalities ( 20), ( 21), and ( 22) to the bound ( 19), we conclude the proof. In Lemma 19, we saw that if step sizes are chosen to satisfy certain conditions, then we can simplify the per-epoch progress a great deal. It is now left to choose appropriate step sizes and parameters (e.g., λ) so as to make sure not only that α and β meet the small step size conditions ( 17) but also that the constants C 0 , C 1 , C 2 , and C 3 are positive. Lemma 20. Suppose that Assumptions 1, 2, 3 and 4 hold. Let λ = 4 and assume that 0 < β ≤ 1 6L q 2 + q(q-1) n-1 A , α = β r , where r ≥ 14κ 2 2 . Then these satisfy all the inequalities (17) and the terms defined in Lemma 19 satisfy C 0 > 0, C 1 > qα, C 2 > Lκ 2 qα/2, C 3 ≥ 0. Consequently, due to the recurrence inequality in Lemma 19, mini-batch simSGDA-RR satisfies, for some numerical constant c > 0, E k [V λ (z k+1 0 )] -V λ (z k 0 ) ≤ -qα ∇Φ(x k 0 ) 2 -(Lκ 2 qα/2) Φ(x k 0 ) -f (z k 0 ) + (cr) 3 L 2 q 2 (q -1) n -1 Bα 3 . ( ) Please note that we mark the recurrence inequality above with a special symbol ( ) because this inequality is the exact point where the proofs of Theorems 4 and 5 start to deviate. Proof. Regardless of A ≥ 0, we have β ≤ 1 6Lq and α ≤ 1 6Lqr ≤ 1 84Lκ 2 2 q . ( ) This is enough to guarantee that the inequalities (17 ) hold with λ = 4. Since C 0 > C 2 /µ 2 , it is enough to show C 2 > 0 to prove that C 0 > 0. Applying λ = 4, κ 2 ≥ 1, and β/α = r ≥ 14κ 2 2 , C 1 qα = 3 2 -2L 2 q 2 + q(q -1) n -1 A (9 + r) α 2 ≥ 3 2 - 2 6 2 • 9 + r r 2 ≥ 3 2 - 2 • 23 6 2 • 14 2 > 1, C 2 µ 2 qβ = 1 - 12κ 2 2 r -2L 2 q 2 + q(q -1) n -1 A 9 r + 1 2κ 2 2 r 2 + 1 β 2 ≥ 1 - 12 14 - 2 6 2 9 14κ 2 2 + 1 2 14 2 κ 2 2 + 1 ≥ 2 14 - 2 • 23 • 198 6 2 • 14 3 > 1 2 • 14 . Thus, C 1 > qα and C 2 > µ 2 qβ 2 • 14 = µ 2 qrα 2 • 14 ≥ Lκ 2 qα/2. Then we conclude the proof by bounding the term C 3 . We can already check from the definition that C 3 ≥ 0. We can upper-bound C 3 by C 3 = L 2 q 2 (q -1) n -1 (9 + r) (1 + r 2 )Bα 3 ≤ (cr) 3 L 2 q 2 (q -1) n -1 Bα 3 , for some numerical constant c > 0.

C.4 CONVERGENCE RATES FOR SMOOTH NONCONVEX-PŁ PROBLEM

In this subsection, we show the convergence bound of general smooth nonconvex-PŁ problems in terms of min k∈[K] E ∇Φ(x k 0 ) 2 . From the inequality ( ) in Lemma 20, we can simply ignore the second term -(Lκ 2 qα/2) Φ(x k 0 ) -f (z k 0 ) ≤ 0 of the right-hand side because Φ(x) ≥ f (x; y) for any (x; y). In other words, we may deal with the inequality E k [V λ (z k+1 0 )] -V λ (z k 0 ) ≤ -qα ∇Φ(x k 0 ) 2 + (cr) 3 L 2 q 2 (q -1) n -1 Bα 3 . (nc-PŁ) Plugging q = n/b, we eventually show the convergence rate (Theorem 4). (Recall that b is the size of mini-batches.) Theorem 21 (Equivalent to Theorem 4, for simSGDA-RR). Suppose that f satisfies Assumptions 1, 2, 3, and 4 are satisfied. Let λ = 4. Choose the step sizes α and β by α = β/r for some r ≥ 14κ 2 2 and β = min    1 6L q 2 + q(q-1) n-1 A , 1 c V λ (z 1 0 ) L 2 q 2 ( q-1 n-1 )BK 1 3    , for some numerical constant c > 0. Then, mini-batch simSGDA-RR satisfies 1 K K k=1 E ∇Φ(x k 0 ) 2 ≤ 6rLV λ (z 1 0 ) K 1 + q -1 n -1 A q + 2cr L 2 B V λ (z 1 0 ) 2 qK 2 • q -1 n -1 1/3 . Proof. To replace the conditional expectations with unconditional expectations, we take expectation to both sides of the inequality (nc-PŁ): E[V λ (z k+1 0 ) -V λ (z k 0 )] ≤ -qαE ∇Φ(x k 0 ) 2 + (cr) 3 L 2 q 2 (q -1) n -1 Bα 3 . Rearranging the terms and taking a sum from k = 1 to k = K, we have qα K k=1 E ∇Φ(x k 0 ) 2 ≤ E[V λ (z 1 0 ) -V λ (z K+1 0 )] + (cr) 3 L 2 q 2 (q -1) n -1 Bα 3 K. Dividing both sides by qKα, we get the following. Note that V λ is non-negative. 1 K K k=1 E ∇Φ(x k 0 ) 2 ≤ V λ (z 1 0 ) qKα + (cr) 3 L 2 q(q -1) n -1 Bα 2 Since our choice of step sizes implies α = min    1 6rL q 2 + q(q-1) n-1 A , 1 cr V λ (z 1 0 ) L 2 Bq 2 ( q-1 n-1 )K 1 3    , we eventually prove the theorem by using the inequality max{a, b} ≤ a + b (for a, b ≥ 0).

C.5 CONVERGENCE RATES FOR SMOOTH PRIMAL-PŁ-PŁ PROBLEM

In this subsection, we prove the convergence bound of primal-PŁ-PŁ (or, PŁ(Φ)-PŁ) problems in terms of E V λ (z K+1

0

) . Unlike the previous subsection, we additionally utilize Assumption 5 stating that f (x; y) satisfies primal PŁ condition, namely, the primal function Φ(x) = max y f (x; y ) is a µ 1 -PŁ function. With this assumption, we yield another recurrence inequality from the inequality ( ). We note that it uses the µ 1 -PŁ condition for Φ (∵ Proposition 10) but not necessarily for f (•; y). Lemma 22. Suppose that f satisfies Assumptions 1, 2, 3, 4, and 5. Then, with the same choice of λ = 4 and the same condition of the step sizes α and β as in Lemma 20, the mini-batch simSGDA-RR satisfies that, for some numerical constant c > 0, E k [V λ (z k+1 0 )] ≤ (1 -µ 1 qα/2)V λ (z k 0 ) + (cr) 3 L 2 q 2 (q -1) n -1 Bα 3 . (PŁ(Φ)-PŁ) Proof. Since the primal function Φ is a µ 1 -PŁ function, -∇Φ(x k 0 ) 2 ≤ -2µ 1 Φ(x k 0 ) -Φ * . Also, since µ 1 ≤ L and κ 2 ≥ 1, we know that -Lκ 2 ≤ -µ 1 . Applying these to the inequality ( ), we have E k V λ (z k+1 0 ) -V λ (z k 0 ) ≤ -(2µ 1 qα/λ) • λ Φ(x k 0 ) -Φ * -(µ 1 qα/2) Φ(x k 0 ) -f (z k 0 ) + (cr) 3 L 2 q 2 (q -1) n -1 Bα 3 = -(µ 1 qα/2) • V λ (z k 0 ) + (cr) 3 L 2 q 2 (q -1) n -1 Bα 3 , since λ = 4. By re-arranging the terms, we conclude the proof. Of course, the multiplier 1 -µ 1 qα/2 has a value between 0 and 1. To see why, note that from Equation ( 23), 0 < µ 1 qα/2 ≤ µ 1 q • 1 2 • 84Lκ 2 2 q = 1 168κ 1 κ 2 2 < 1. Theorem 23 (Equivalent to Theorem 5, for simSGDA-RR). Assume that f satisfies Assumptions 1, 2, 3, 4, and 5. Let λ = 4. Choose the step sizes by α = β/r for some r ≥ 14κ 2 2 and β = min    1 6L q 2 + q(q-1) n-1 A , 2r µ 1 qK max    1, log   V λ (z 1 0 )µ 1 qK 2 8(cr) 3 κ 2 1 q-1 n-1 B         , for some numerical constant c > 0. Then, mini-batch simSGDA-RR satisfies E[V λ (z K n )] ≤ O     V λ (z 1 0 ) • exp     - K 12κ 1 r 1 + q-1 n-1 A q         + Õ κ 2 1 r 3 B µ 1 qK 2 • q -1 n -1 . Proof. To replace the conditional expectations with unconditional expectations, we take expectation to both sides of the inequality (PŁ(Φ)-PŁ): E V λ (z k+1 0 ) ≤ (1 -µ 1 qα/2)E V λ (z k 0 ) + (cr) 3 L 2 q 2 (q -1) n -1 Bα 3 . Unrolling the recurrence inequality (Proposition 15) and using the facts β = 14κ 2 2 α, we have E[V λ (z K n )] ≤ (1 -µ 1 qα/2) K V λ (z 1 0 ) + 2 • (cr) 3 L 2 µ 1 qα q 2 (q -1) n -1 Bα 3 ≤ exp(-µ 1 qKα/2)V λ (z 1 0 ) + 2(cr) 3 µ 1 κ 2 1 q(q -1) n -1 Bα 2 . ( ) Note that, in the inequality above, the second term of the right hand side becomes zero when q = 1. In that case, we can prove exponential decay of E[V λ (z k 0 )]. Thus, we simply assume q > 1 hereafter. Case 1: If K is as large as K > κ 1 r 3/2 √ µ 1 • 8c 3 eB V λ (z 1 0 ) q q -1 n -1 , (e = exp(1)) we have a step size α as α = min    1 6Lr q 2 + q(q-1) n-1 A , 2 µ 1 qK log (♣)    , where ♣ = V λ (z 1 0 )µ 1 qK 2 8(cr) 3 κ 2 1 κ 6 2 q-1 n-1 B . Due to the lower bound of epoch size K, the fraction ♣ inside the log factor is indeed greater than e > 1, which guarantees the step size is positive. Putting this to the inequality (24) and using the fact that max{a, b} ≤ a + b (for a, b ≥ 0), we eventually have E V λ (z K n ) ≤ V λ (z 1 0 ) • exp     - K 12κ 1 r 1 + q-1 n-1 A q     + 2 • 8(cr) 3 κ 2 1 B µ 1 qK 2 q -1 n -1 1 + log 2 (♣) = V λ (z 1 0 ) • exp     - K 12κ 1 r 1 + q-1 n-1 A q     + Õ κ 2 1 r 3 B µ 1 qK 2 • q -1 n -1 . Case 2: Otherwise, the log factor might have a negative value when K is too small. However, in this case, we have V λ (z 1 0 ) ≤ 8(cr) 3 eκ 2 1 B µ 1 qK 2 • q -1 n -1 ; α = min    1 84Lκ 2 2 q 2 + q(q-1) n-1 A , 2 µ 1 qK    . Putting these to the inequality (24), we have E V λ (z K n ) ≤ 8(cr) 3 eκ 2 1 B µ 1 qK 2 q -1 n -1 exp(-µ 1 qKα/2) + 1 e • (µ 1 qKα/2) 2 ≤ 8(cr) 3 eκ 2 1 B µ 1 qK 2 q -1 n -1 = O κ 2 1 r 3 B µ 1 qK 2 • q -1 n -1 . The inequality in the last line is due to the fact that e -t + t 2 /e ≤ 1 for each t ∈ (0, 1], and that µ 1 qKα/2 ∈ (0, 1]. Combining both Case 1 and Case 2, we conclude the proof of the theorem.

D PROOFS FOR (MINI-BATCH) ALTERNATING SGDA-RR: FOCUSING ON CHANGES IN THE PROOF

In this appendix, we prove the same convergence rates for altSGDA-RR as the simultaneous update counterpart. Since most of the steps in the proof are similar to those in Appendix C, we only describe which steps change in the proof.

D.1 EPOCH-WISE REPRESENTATIONS AND BOUNDING NOISE TERMS

To analyze altSGDA-RR, we modify the notation for epoch-wise updates. The only change is that an update y k t → y k t+1 uses x k t+1 instead of x k t . Hence, the definition of h k should be modified. Recall that g k t (z) := 1 b i∈B k t ∇ 1 f i (z), h k t (z) := 1 b i∈B k t ∇ 2 f i (z), where B k t is a mini-batch of size b formed at iteration t of epoch k. Then, at epoch k, by re-definition of h k , g k := 1 q q t=1 g k t (x k t-1 ; y k t-1 ), h k := 1 q q t=1 h k t (x k t ; y k t-1 ). x k+1 0 = x k 0 -qαg k , y k+1 0 = y k 0 + qβh k . (altSGDA-RR) We still approximate this epoch-wise update rule to a full-batch simultaneous GDA update (≈simGDA) with step sizes qα and qβ. Again, we control the "noise" terms g k -∇ 1 f (z k 0 ) 2 and h k -∇ 2 f (z k 0 ) 2 not to be large. Because of the modification of h k , we have a different result for h k -∇ 2 f (z k 0 ) 2 as follows. Lemma 24. For mini-batch altSGDA-RR, recall that G k := 1 q q t=1 z k t-1 -z k 0 2 . If we have Assumption 1, then we have h k -∇ 2 f (z k 0 ) 2 ≤ L 2 G k + L 2 qα 2 g k 2 , whereas g k -∇ 1 f (z k 0 ) 2 ≤ L 2 G k . (25) Proof. Because of L-Lipschitz continuity of h k t (•; •), h k -∇ 2 f (z k 0 ) 2 = 1 q q t=1 h k t (x k t ; y k t-1 ) -h k t (x k 0 ; y k 0 ) 2 ≤ 1 q q t=1 h k t (x k t ; y k t-1 ) -h k t (x k 0 ; y k 0 ) 2 ≤ L 2 q q t=1 z k t-1 -z k 0 2 + L 2 q x k q -x k 0 2 = L 2 G k + L 2 qα 2 g k 2 . The last ineqaulity holds because x k q = x k+1 0 .

D.2 BOUNDING NOISE TERMS: A BIT DIFFERENT PROOF OF LEMMA 18

We notice that the same result as Lemma 18 holds not only for simultaneous updates but also alternating updates, even though it is not very straightforward. We need to reflect the changes from the previous subsection. That is, we have to be careful when we expand the term y k t -y k 0 2 (0 ≤ t ≤ q -1). Unlike the inequality (12) (in the original proof), we have y k t -y k 0 2 = β 2 t 2 1 t t j=1 h k j (x k j ; y k j-1 ) 2 ≤ 3β 2 t 2    1 t t j=1 h k j (x k j ; y k j-1 ) -h k j (z k 0 ) 2 + 1 t t j=1 h k j (z k 0 ) -∇ 2 f (z k 0 ) 2 + ∇ 2 f (z k 0 ) 2    ≤ 3β 2 t 2    1 t t j=1 h k j (x k j ; y k j-1 ) -h k j (z k 0 ) 2 + 1 t t j=1 h k j (z k 0 ) -∇ 2 f (z k 0 ) 2 + ∇ 2 f (z k 0 ) 2    ≤ 3β 2 t 2    L 2 t   x k t -x k 0 2 + t j=1 z k j-1 -z k 0 2   + 1 t t j=1 h k j (z k 0 ) -∇ 2 f (z k 0 ) 2 + ∇ 2 f (z k 0 ) 2    ≤ 3β 2 L 2 t t j=1 z k j -z k 0 2 + 3β 2 t 2    1 t t j=1 h k j (z k 0 ) -∇ 2 f (z k 0 ) 2 + ∇ 2 f (z k 0 ) 2    ≤ 3β 2 L 2 t • qG k + 3β 2 t 2    1 t t j=1 h k j (z k 0 ) -∇ 2 f (z k 0 ) 2 + ∇ 2 f (z k 0 ) 2    . The second and third inequality holds by Jensen's inequality, and the last inequality holds because t ≤ q -1. The resulting upper bound is identical to the inequality (13). Proving this inequality above suffices to show that the conclusion of Lemma 18 also holds for altSGDA-RR, because we eventually take an average along 0 ≤ t ≤ q -1 and the other steps in the proof do not utilize the "order" (either simultaneous or alternating) of updates.

D.3 RECURRENCE INEQUALITIES FOR GENERAL SMOOTH NONCONVEX-PŁ OBJECTIVE

In the proof for simSGDA-RR, we applied Lemma 16, Lemma 18, and the "small-step-size" assumptions (three inequalities in ( 17)) to deduce Lemma 19. However, due to Lemma 24 that we obtained for altSGDA-RR, we need slightly different assumptions on step sizes rather than (17). Fortunately, we notice that the Lemma 16 also holds for altSGDA-RR, with a modified version of h k . This is because the proof of the lemma does not utilize step-wise updates, while the discrepancy between simultaneous and alternating updates only appears in the step-wise updates. Thus, we have the same result as Lemma 19. Lemma 25. Suppose that Assumptions 1, 2, 3, and 4 hold. Modify the inequalities (17) (from Lemma 19) by λ -{(λ + 1)(κ 2 + 1) + 1} Lqα -L 2 qαβ ≥ 0, β ≤ 1 qL , α 2 + β 2 ≤ 1 3q(q -1)L 2 . ( ) (In fact, only the first one is different.) Then, the result of Lemma 19 still holds for mini-batch altSGDA-RR. Proof. We first apply Lemma 24 to the general bound resulted from Lemma 16: V λ (z k+1 0 ) -V λ (z k 0 ) ≤ - λ + 1 2 qα ∇Φ(x k 0 ) 2 + (λ + 1)qα ∇Φ(x k 0 ) -∇ 1 f (z k 0 ) 2 + qα 2 ∇ 1 f (z k 0 ) 2 - qβ 2 ∇ 2 f (z k 0 ) 2 + (2λ + 1)α + β 2 qL 2 G k -λ -{(λ + 1)(κ 2 + 1) + 1} Lqα -L 2 qαβ qα 2 g k 2 -(1 -Lqβ) qβ 2 h k 2 . ( ) Hence, the first two inequalities of (26) eliminate the last two terms on the right side of the inequality (27) above: V λ (z k+1 0 ) -V λ (z k 0 )≤ - λ + 1 2 qα ∇Φ(x k 0 ) 2 + (λ + 1)qα ∇Φ(x k 0 ) -∇ 1 f (z k 0 ) 2 + qα 2 ∇ 1 f (z k 0 ) 2 - qβ 2 ∇ 2 f (z k 0 ) 2 + (2λ + 1)α + β 2 qL 2 G k . This is identical to the inequality (18) in the proof of Lemma 19. From this point on, the rest of the proof is exactly identical to Lemma 19. Lemma 25 establishes that altSGDA-RR also satisfies a concise bound on the expected per-epoch change of V λ , albeit under a slightly different set of assumptions (26) on step sizes. Using this result, we can prove the convergence rates for altSGDA-RR that are exactly the same as simSGDA-RR.

D.4 SMALL STEP SIZE ASSUMPTIONS

It is left to show an altSGDA-RR counterpart for Lemma 20 which establishes the general recurrence inequality ( ). In fact, the same choice of step sizes as simSGDA-RR, namely 0 < β ≤ 1 6L q 2 + q(q-1) n-1 A and α = β r where r ≥ 14κ 2 2 , actually meets the newly introduced conditions (26). Among the three inequalities, the only one that needs to be checked is λ -{(λ + 1)(κ 2 + 1) + 1} Lqα -L 2 qαβ > 0. Note that, regardless of A ≥ 0, β ≤ 1 6Lq and α ≤ 1 6Lqr ≤ 1 84Lκ 2 2 q In this case, λ -{(λ + 1)(κ 2 + 1) + 1} Lqα -L 2 qαβ ≥ 4 -(11κ 2 + Lβ)Lqα ≥ 4 -11κ 2 + 1 6 • 1 84κ 2 2 > 0. Therefore, there is no need to modify our choices of λ and the step sizes α, β for the analysis of altSGDA-RR, and the rest of the proof for simSGDA-RR goes through.

E PROOFS FOR LOWER BOUND OF DETERMINISTIC FULL-BATCH SIMGDA

In this appendix, we illustrate a comprehensive lower bound for full-batch GDA, which is specific to the choice of step size ratio (Theorem 3). Before we start the proof, we define a class of smooth strongly-convex-strongly concave functions. Definition 4. Let F(L, µ 1 , µ 2 ) be the class of functions f (x; y) with two arguments x and y of any dimension, which is L-smooth, µ 1 -strongly-convex in x, and µ 2 -strongly-concave in y. Let κ 1 = L/µ 1 ≥ 1 and κ 2 = L/µ 2 ≥ 1 be condition numbers of the function class. Denote the (unique) saddle (or, global minimax) point by z * = (x * ; y * ). We restate and prove the Theorem 3 for reader's convenience. Theorem 26 (Restatement of Theorem 3). Suppose κ 1 ≥ c and κ 2 ≥ c for some constant c > 1. Then, for each step size ratio r > 0, there exists a function f ∈ F(L, µ 1 , µ 2 ) for which simGDA with any step sizes α and β of ratio r = β/α requires K = Ω (κ 1 r log(1/ε)) , if r ≥ κ 2 /c, Ω (κ 1 κ 2 log(1/ε)) , if c/κ 1 ≤ r ≤ κ 2 /c, Ω((κ 2 /r) log(1/ε)), if 0 < r ≤ c/κ 1 iterations to achieve either z k -z * 2 ≤ ε 2 or V λ (z K ) ≤ ε 2 . Proof. The proof is done in case by case, constructing a worst-case function for each of 4 different regimes of step size ratio r: (1)  µ 1 /µ 2 ≤ r ≤ κ 2 /c, (2) c/κ 1 ≤ r ≤ µ 1 /µ 2 , (3) r ≥ κ 2 /c , and (4) 0 < r ≤ c/κ 1 . Readers might notice the similarities of the proofs for (1)↔( 2) and (3)↔(4). Case 1. (µ 1 /µ 2 ≤ r ≤ κ 2 /c). Consider f (1) (v, x; y) := µ 1 2 v 2 + rµ 2 2 x 2 - µ 2 2 y 2 + xy, where 2 = L 2 -rµ 2 2 -Lµ 2 |r -1| ≥ 0. Applying Proposition 28, it can be shown that f (1) ∈ F(L, µ 1 , µ 2 ). Also, z * = (0, 0; 0) is its unique saddle point. Note that, the GDA on f (1) can be written as v t+1 = 1 - βµ 1 r v t , x t+1 y t+1 = 1 -βµ 2 -β /r β 1 -βµ 2 A x t y t = A x t y t . Also, the eigenvalues τ of A is τ = 1 -βµ 2 ± (1 -βµ 2 ) 2 -((1 -βµ 2 ) 2 + β 2 2 /r) = 1 -βµ 2 ± β √ r √ -1. The spectral radius (i.e., maximum absolute eigenvalue) is ρ(A) = (1 -βµ 2 ) 2 + β 2 2 /r. Since the eigenvalues are complex conjugates of each other (the magnitudes are the same), both eigenvalues have magnitude ρ(A). Then, by Proposition 27, ρ(A) < 1 is necessary for convergence. To this end, we need β > 0 satisfying β < 2µ 2 r/(rµ 2 2 + 2 ). To guarantee (v k , x k ; y k ) -(0, 0; 0) 2 ≤ ε 2 , we need a large enough k to have v 2 k ≤ O(ε 2 ). Such a k is required to be at least Ω r βµ1 log(1/ε) . Now note that, since µ 1 /µ 2 ≤ r ≤ κ 2 /c and κ 2 ≥ c, 1 β > rµ 2 2 + 2 2µ 2 r = L 2 -Lµ 2 |r -1| 2µ 2 r = L 2 2µ 2 r 1 - |r -1| κ 2 ≥ L 2 2µ 2 r 1 - 1 c . The last inequality is true by minimizing 1 -|r-1| κ2 for r ∈ [µ 1 /µ 2 , κ 2 /c]. If r ≥ 1 , it has smaller value when r is larger: by taking r = κ 2 /c, we have 1 -κ2/c-1 κ2 ≥ 1 -1 c . Otherwise (r < 1) , which is possible only when µ 1 < µ 2 , the term has smaller value when r is smaller: by taking r = µ 1 /µ 2 , we have 1+ µ1/µ2-1 κ2 = 1+ µ1-µ2 L ≥ 1-1 κ2 ≥ 1-1 c . Thus, we eventually need Ω L 2 µ1µ2 log(1/ε) iterations. Case 2. (c/κ 1 ≤ r ≤ µ 1 /µ 2 ). Consider f (2) (x; y, w) := µ 1 2 x 2 - µ 1 2r y 2 + ˜ xy - µ 2 2 w 2 , where ˜ 2 = L 2 -µ 2 1 /r -Lµ 1 |1 -1/r| ≥ 0. Applying Proposition 28, it can be shown that f (2) ∈ F(L, µ 1 , µ 2 ), and z * = (0; 0, 0) is its unique saddle point. Note that, the GDA on f (2) can be written as x t+1 y t+1 = 1 -βµ 1 /r -β /r β 1 -βµ 1 /r B x t y t = B x t y t , w t+1 = (1 -βµ 2 ) w t . Also, the eigenvalues τ of B is τ = 1 -βµ 1 /r ± (1 -βµ 1 /r) 2 -((1 -βµ 1 /r) 2 + β 2 2 /r) = 1 - βµ 1 r ± β √ r √ -1. The spectral radius is ρ(B) = (1 -βµ 1 /r) 2 + β 2 2 /r. Since the eigenvalues are complex conjugates of each other (the magnitudes are the same), both eigenvalues have magnitude ρ(B). Then, by Proposition 27, ρ(B) < 1 is necessary for convergence. To this end, we need β > 0 satisfying β < 2µ 1 /(µ 2 1 /r + 2 ). To guarantee (x k ; y k , w k ) -(0; 0, 0) 2 ≤ ε 2 , we need a large enough k to have w 2 k ≤ O(ε 2 ). Such a k is required to be at least Ω 1 βµ2 log(1/ε) . Now note that, since c/κ 1 ≤ r ≤ µ 1 /µ 2 and κ 1 ≥ c, 1 β > µ 2 1 /r + 2 2µ 1 = L 2 -Lµ 1 |1 -1/r| 2µ 1 = L 2 2µ 1 1 - |1 -1/r| κ 1 ≥ L 2 2µ 1 1 - 1 c . The last inequality is true by minimizing 1 -|1-1/r| κ1 for r ∈ [c/κ 1 , µ 1 /µ 2 ]. If 1 > 1/r, which is possible only when µ 1 > µ 2 , it has smaller value when r is larger: by taking r = µ 1 /µ 2 , we have 1 -1-µ2/µ1 κ1 = 1 -µ1-µ2 L ≥ 1 -1 κ1 ≥ 1 -1 c . Otherwise (1 < 1/r), the term has smaller value when r is smaller: by taking r = c/κ 1 , we have 1 + 1-κ1/c κ1 ≥ 1 -1 c . Thus, we eventually need Ω L 2 µ1µ2 log(1/ε) iterations. This is because the characteristic polynomial of HH , or det(ωI -HH ), is a quadratic polynomial of ω, and its maximum root should not be greater than L 2 . Let 2 = L 2 -µ 1 µ 2 + a for some a ∈ R. Plugging this 2 into both inequalities above, we get a 2 -(µ 1 -µ 2 ) 2 L 2 ≥ 0 and a ≤ -(µ 1 -µ 2 ) 2 /2, respectively. One can check that a = -L|µ 1 -µ 2 | is the largest possible a satisfying both inequalities above. This proves the proposition. Subsequently, we show that if our convergence rate is exponential, then the iteration complexity in terms of zz * 2 is equivalent to that in terms of V λ (z) = λ[Φ(x) -Φ * ] + [Φ(x) -f (z)] for PŁ(Φ)-PŁ problem, up to constant factors. This also applies to the function class F(L, µ 1 , µ 2 ) since it is a subclass of smooth PŁ(Φ)-PŁ functions (∵ Propositions 7 and 10). Lemma 29. Suppose f (x; y) is an L-smooth function satisfying y-side µ 2 -PŁ condition and primal µ 1 -PŁ condition (i.e., PŁ(Φ)-PŁ). Suppose z * = (x * ; y * ) is a global minimax point of f . Then, it satisfies λµ 1 µ 2 2 2(λµ 1 µ 2 + 2L 2 ) z -z * 2 ≤ V λ (z) ≤ (λ + 1)L 3 µ 2 2 z -z * 2 . We remark that the second inequality also holds for general smooth nonconvex-PŁ problems. Proof. Let κ 1 = L/µ 1 and κ 2 = L/µ 2 be condition numbers. By the conditions of f (smoothness and PŁ conditions), for any x and y, µ 1 2 x -x * 2 Prop. 10 ≤ Φ(x) -Φ * Prop. 9 ≤ L(κ 2 + 1) 2 x -x * 2 , µ 2 2 y -y * (x) 2 Ass. 4 ≤ Φ(x) -f (x; y) Ass. 1 ≤ L 2 y -y * (x) 2 , where y * (x) is a projection of y to arg max y f (x; y ). In particular, y * (x * ) = y * . Since y * (x) is a function of x and can differ from y * , we need to bound the term yy * (x) 2 using xx * 2 and yy * 2 . To upper-bound the term yy * (x) 2 , note that, y -y * (x) 2 ≤ ( y -y * (x * ) + y * (x) -y * (x * ) ) 2 ≤ ( y -y * + κ 2 x -x * ) 2 ≤ 1 + κ 2 2 y -y * 2 + x -x * 2 . The first inequality holds by triangle inequality, the second inequality holds by Proposition 8, and the last inequality holds by Cauchy-Schwarz inequality.foot_10 To lower-bound in a similar way, note that for any constant a > 0, y -y * 2 ≤ ( y -y * (x) + y * (x) -y * (x * ) ) 2 ≤ y -y * (x) + κ 2 √ a • √ a x -x * 2 ≤ 1 + κ 2 2 a y -y * (x) 2 + a x -x * 2 . ∴ y -y * (x) 2 ≥ 1 1 + κ 2 2 /a y -y * 2 -a x -x * 2 . Now we can prove the inequalities in the lemma. We first show the second one. Applying κ 2 ≥ 1 multiple times, V λ (x; y) = λ[Φ(x) -Φ * ] + [Φ(x) -f (z)] ≤ λL(κ 2 + 1) 2 x -x * 2 + L 2 y -y * (x) 2 ≤ λL(κ 2 + 1) 2 + L(1 + κ 2 2 ) 2 x -x * 2 + L(1 + κ 2 2 ) 2 y -y * 2 ≤ (λ + 1)Lκ 2 2 x -x * 2 + y -y * 2 = (λ + 1)L 3 µ 2 2 z -z * 2 . To show the first inequality of the lemma, let a = λµ1 2µ2 . V λ (x; y) ≥ λµ 1 2 x -x * 2 + µ 2 2 y -y * (x) 2 ≥ λµ 1 2 - µ 2 a 2 x -x * 2 + µ 2 2(1 + κ 2 2 /a) y -y * 2 ≥ λµ 1 4 x -x * 2 + λµ 1 4(a + κ 2 2 ) y -y * 2 ≥ λµ 1 4(a + κ 2 2 ) x -x * 2 + y -y * 2 = λµ 1 µ 2 2 2(λµ 1 µ 2 + 2L 2 ) z -z * 2 . This concludes the proof. The equivalence of iteration complexities for achieving z K -z * 2 ≤ ε 2 or V λ (z K ) ≤ ε 2 is quite straightforward from this lemma, as long as the convergence speed is exponential. For example, suppose we have a upper convergence bound z K -z * 2 ≤ a exp(-K/r) for some constants a, r > 0. This implies a upper iteration complexity bound K = O(r log(1/ε)) sufficient to achieve z K -z * 2 ≤ ε 2 . Then by Lemma 29, we also have V λ (z K ) 2 ≤ a exp(-K/r) where a = a(λ + 1)L 3 /µ 2 2 is also a constant. This implies a lower iteration complexity bound K = O(r log(1/ε)) as well, sufficient to achieve V λ (z K ) 2 ≤ ε 2 . The other way of complexity translation operates with a similar logic.

F REMARK ON SMOOTHNESS ASSUMPTIONS AND LOWER BOUND OF WITH-REPLACEMENT SGD(A)

During the discussion phase of the conference, a reviewer raised a question about whether or not the component smoothness (Assumption 1) is more crucial than the without-replacement component sampling for faster convergence. However, we would like to claim that the component smoothness alone is not sufficient for improving the convergence rate for with-replacement SGD(A). To this end, we provide some formal results on lower convergence bounds. For simplicity, we use mini-batches of size 1 throughout this appendix. Firstly, the theorem below provides a lower bound on with-replacement SGD for minimization problems. Readers can also verify that an analogous lower bound holds for SGD with unbiased and independently sampled gradient oracle for more general stochastic minimization problems. The proof will appear later in this appendix. Theorem 30. For any step size η > 0, there exists a real-valued strongly-convex function f (x) defined on R d with f * := min x f (x), satisfying: 1. f consists of n > 1 smooth component functions f i : f (x) = 1 n n i=1 f i (x), where each component f i is smooth; 2. After running T > 1 iterations of with-replacement SGD (with mini-batch size 1) starting from x 0 ∈ R d , the last iterate x T satisfies E[f (x T )-f * ] ≥ Ω(1/T ), where the expectation is taken with respect to the randomness of i.i.d. index choice at each iteration. Next, we show this theorem naturally induces a convergence lower bound for the minimax counterpart: with-replacement SGDA. Consider a (finite-sum) minimax problem min x max y g(x, y) := f (x) -f (y), where f = 1 n n i=1 f i is a worst-case function in the proof of Theorem 30. Here, the minimax problem on g can be solved by minimizing f . Moreover, since the primal function Φ(x) := max y g(x, y) associated with g is in fact the same as f (x) -f * , the potential function V λ (x, y) := λ[Φ(x) -(min x Φ(x))] + [Φ(x) -g(x, y)] becomes the same as λ(f (x) -f * ) + (f (y) -f * ) for a constant λ > 0. Combining these facts, we can immediately obtain the following lower convergence bound of with-replacement SGDA. Corollary 1. There exists a strongly-convex-strongly-concave function g(x, y) := 1 n n i=1 g i (x, y) consisting of n smooth component functions g i , where the last iterate (x T , y T ) of with-replacement SGDA satisfies E[V (x T , y T )] ≥ Ω(1/T ). Corollary 1 formally proves that with-replacement SGDA on strongly-convex-strongly-concave minimax problems with smooth components has a worst-case convergence rate Ω(1/T ). This in fact matches the O(1/T ) upper bound obtained for primal-PŁ-PŁ problems by Yang et al. (2020) . Considering that strongly-convex-strongly-concave functions form a strict subset of primal-PŁ-PŁ functions, Corollary 1 establishes that adding component smoothness assumption does not provide further speed up for with-replacement SGDA. In contrast, our theoretical result in Theorem 2 shows that SGDA-RR has a much faster convergence rate E[V λ ] ≤ Õ( 1nK 2 ) for primal-PŁ-PŁ minimax problems, where K is the number of epochs. One can check that our Õ( 1nK 2 ) bound is faster than the tight convergence rate Θ(1/T ) of withreplacement SGDA by simply plugging in T = nK. In light of Corollary 1 we proved, we can now claim that the improvement can be solely attributed to RR. Although we do not provide a lower bound for more general nonconvex-PŁ problems here, we believe the more challenging case of nonconvex-PŁ lower bound is a topic for another separate paper. Nonetheless, we conjecture that the speed up by SGDA-RR in nonconvex-PŁ settings is also due to the effect of RR, not component smoothness. From now on, we provide the postponed proof of Theorem 30. Proof of Theorem 30. We construct worst-case functions with quadratic functions on R, which are clearly L-smooth for a fixed constant L > 0. Then, it is easy to extend the logic to the functions with domains of higher dimensions. Let x 0 ∈ R be the initial iterate. Case 1 ( 1 LT ≤ η ≤ 2 L -1 LT ). Note that the condition on the step size, 1 LT ≤ η ≤ 2 L -1 LT , is equivalent to an inequality (1 -ηL) 2 ≤ (1 -1/T ) 2 . We first assume n is an even number. We will encounter the case with an odd n > 1 a bit later. Consider f (x) = L 2 x 2 consisting of even number of components f i 's defined by f i (x) = L 2 x 2 + νx, (i ≤ n 2 ), L 2 x 2 -νx, (i ≥ n 2 + 1), for some number ν ∈ R. At each iteration t ≥ 1, we choose a component index i(t) i.i.d. ∼ Unif([n]) (with-replacement sampling). Then we can write the chosen component function at iteration t as f i (t) = L 2 x 2 -s t νx for some i.i.d. random variable s t ∼ Unif({±1}). Accordingly, an SGD step can be written as x t = x t-1 -η∇f i(t) (x t-1 ) = (1 -ηL)x t-1 + ηs t ν. By applying telescopic sum, we have x T = (1 -ηL) T x 0 + ην T t=1 (1 -ηL) (T -t) • s t . Taking squares and expectations (with respect to the random variables s 1 , . . . , s T ) to both sides, we have E[x 2 T ] = (1 -ηL) 2T x 2 0 + η 2 ν 2 T t=1 (1 -ηL) 2(T -t) , by applying the fact that s t 's are zero-mean independent random variables with absolute values 1: E[s t • s t ] = 0, t = t (∵ independent), 1, t = t (∵ s 2 t = 1). We calculate the sum above as follows: since (1 -ηL) 2 ≤ (1 -1/T ) 2 and (1 -1/T ) T ≤ e -1 , T t=1 (1 -ηL) 2(T -t) = 1 -(1 -ηL) 2T 1 -(1 -ηL) 2 ≥ 1 -(1 -1 T ) 2T 2ηL(1 -ηL 2 ) ≥ 1 -e -2 . With this inequality, and since (1 -ηL) 2T x 2 0 ≥ 0, we can lower-bound the expectation E[x 2 T ]: E[x 2 T ] ≥ η 2 ν 2 • 1 -e -2 2ηL = (1 -e -2 )ν 2 2L η ≥ (1 -e -2 )ν 2 2L 2 T . Since f has a minimum f * = 0 at x = 0, we eventually have E[f (x T ) -f * ] = L 2 E[x 2 T ] ≥ (1 -e -2 )ν 2 4LT = Ω ν 2 LT . Now we consider the case when the number of components n > 1 is odd. Consider f n (x) ≡ 0 and let the remaining n -1 components be the same as the case above (with an even number of components). Note that the zero-component f n does not affect the trajectory of SGD (i.e., the points visited by SGD) and the optimality of f (f * = 0 at x = 0), while the whole objective function becomes f (x) = n-1 n • L 2 x 2 . Thus, it can be easily shown that the Ω ν 2 LT lower bound also holds. Case 2 (0 < η < 1 LT or η > 2 L -1 LT ). From the condition on the step size, we have (1 -ηL) 2 > (1 -1/T ) 2 . Consider f i (x) = L 2 x 2 for every i ∈ [n]: every components are the same. In this case, we show that the last iterate of SGD is bounded below by a constant with respect to T > 1. At each iteration t ≥ 1, we obtain x t = (1 -ηL)x t-1 by a step of SGD. Then, applying T ≥ 2, x 2 T = (1 -ηL) 2T • x 2 0 > 1 - 1 T 2T x 2 0 ≥ 1 - 1 2 4 x 2 0 = x 2 0 16 . Since f (x) = 1 n n i=1 f i (x) = L 2 x 2 has a minimum f * = 0 at x = 0, we have f (x T ) -f * > Lx 2 0 32 = Ω(1) • Lx 2 0 .

G EXPERIMENTS: QUADRATIC GAMES

In this appendix, we provide a more detailed illustration of our numerical evaluations on quadratic games introduced in Section 6. Recall that the objective function f and its component functions f i are given in Equation (5) as f (x; y) = 1 2 x Ax + x By -1 2 y Cy, f i (x; y) = 1 2 x A i x + x B i y - 1 2 y C i y + u i x -v i y. We choose the same dimensions for the variables x ∈ R dx and y ∈ R Notice that the discrepancy between component functions gets larger as ∆ grows. Technically, one can check that the gradient variance (that we controlled in Assumption 2) is proportional to the norms of the vectors u i and v i . Moreover, we have already discussed that the gap between convergence speeds of SGDA and SGDA-RR becomes larger especially when the gradient variance is large. Now, we present the results of numerical experiments by varying the values of ∆ to 10, 20, and 40, while fixing L B = 4, µ C = 0.4, b = 1, and other experiment parameters. As shown in Figure 2 , we can observe that the difference between the random-reshuffling algorithm and the uniform-sampling algorithm gets larger as ∆ increases.

G.3 COMPARISON: THE EFFECT OF CONDITION NUMBER

Here, we present the results of experiments by varying the values of κ 2 to 5, 10, and 20, while fixing ∆ = 20, b = 1, and other experiment parameters. To this end, we applied the parameter settings for L B and µ C as (L B , µ C ) = (2.5, 0.5), (4, 0.4), (5, 0.25), respectively. The results are shown in Figure 3 . We observe that more epochs are required for convergence when κ 2 increases, regardless of the type of algorithm. One may think that the performance gap between RR-based/non-RR-based algorithms is small when κ 2 is huge. However, when we run the algorithm for an extended number of epochs, we observe a significant gap in convergence speeds.

G.4 COMPARISON: THE EFFECT OF BATCH SIZE

The last comparison is about the effect of batch size b ∈ {1, 25, 50, 100}. Recall that we linearly scale the step sizes as the batch size changes. However, since the number of epochs is fixed, the number of iterations decreases as b gets larger. As the readers can notice, the convergence behavior of SGDA (resp., SGDA-RR) and AGDA (resp., AGDA-RR) are similar in our construction of quadratic games. Thus, in this subsection, we only compare simSGDA and its variants. Rather, we introduce two more methods of component choice other than with-replacement uniform sampling and random reshuffling: • WORB(WithOut-Replacement mini-Batching): every mini-batch is without-replacement & uniformly-randomly sampled, while any pair of mini-batches in an epoch may have some indices in common; the same as b-minibatch sampling (Loizou et al., 2021) . • NS(No Shuffle): accessing 1, ..., n in its predefined order to construct mini-batches; without-replacement but deterministic. Remark: for minimization problems, SGD with NS is usually referred to as incremental gradient (IG) algorithm (Mishchenko et al., 2020) . These two methods are somewhat related to without-replacement component sampling, whereas they are both different from RR which uniformly randomly samples a permutation of [n] every epoch. We call simSGDA using mini-batches sampled by WORB and NS as simSGDA-WORB and simSGDA-NS, respectively. Remarks: If b = 1, simSGDA-WORB becomes the same algorithm as vanilla simSGDA. Also, since we choose n = 100, if b = n = 100, all three algorithms simSGDA-RR/-WORB/-NS become the same as deterministic & full-batch (simultaneous) GDA. The results are shown in Figure 4 . One can notice that the potential plots of simSGDA, simSGDA-RR, and simSGDA-NS are respectively the same even if we change the batch size (b < 100). Also, if b > 1, simSGDA-WORB has better performance than vanilla simSGDA. These imply that without-replacement mini-batches benefit the convergence speed to some extent in our quadratic game. However, the result of experiments also implies that both (i) without-replacement per epoch (i.e., shuffling) and (ii) randomization are indeed essential for fast convergence in our quadratic game experiments. In particular, WORB requires a very large batch size but still has a much slower convergence rate than RR (see Figure 4c which is the case of using half of the total components at each iteration).

H OMITTED COMPARISON WITH RELATED WORKS H.1 COMPARISON WITH XIE ET AL. (2021)

To specialize Xie et al. (2021, Theorem 3) to the single-machine setup and discuss their results in terms of our notation, we need to replace their symbols (T, S, K, σ 2 1 , σ 2 2 , G 2 1 , G 2 2 , L 12 , L f , µ, L Φ , L 0 , η t , γ t )



As we noted, Assumption 1 directly implies the average smoothness which is a common requirement in the analysis with unbiased gradient oracles. Nevertheless, we claim that Assumption 1 is not more crucial than without-replacement sampling to obtain faster convergence rates: see Appendix F for details and proofs. We say a function g : R d → R is µ-strongly convex for some µ > 0 if it holds g(x ) ≥ g(x) + ∇g(x), xx + (µ/2) xx 2 (∀x, x ); we say g is µ-strongly concave if -g is µ-strongly convex. The PŁ(Φ)-PŁ condition is much weaker than two-sided PŁ condition assuming "x-side" PŁ condition: see Proposition 10. As pointed out byGuo et al. (2020), there exist a PŁ(Φ)-PŁ function g(x; y) that is not x-side µ-PŁ for any µ > 0 but even strongly concave in x. Although they consider two-sided PŁ problems, their analysis applies to PŁ(Φ)-PŁ problems as well. strongly-convex-strongly-concave (SC-SC) ⊂ two-sided PŁ ⊂ PŁ(Φ)-PŁ ⊂ nonconvex-PŁ. g(x; y) = -L 2 x 2 + Lxy -µ 2 y 2, where L/µ > 1: its primal function is strongly convex. Loosely speaking, a differential Stackelberg equilibrium is a stationary point (x * ; y * ) where f (x * ; •) is locally strongly concave near y * and Φ(•) is locally strongly convex near x * . As described in Section 4.2, AGDA-RR uses only one-side gradient (∇1 or ∇2) at each iteration; given a fixed budget of gradient computations, it should access components twice as many times as SGDA-RR. Hence, we report the values at every other iteration of AGDA & AGDA-RR, for a fair comparison. For any a, b ∈ R d , 2 a, b = a 2 + b 2ab 2 . For any a, b ∈ R d , a + b 2 ≤ 2 a 2 + 2 b 2 . (ax + by) 2 ≤ (a 2 + b 2 )(x 2 + y 2 ) for real numbers a, b, x, y. During and after the discussion phase, we performed some more experiments. As we tried to plot all the results over iterations, the size of the figures in pdf format became too large. Consequently, in this appendix, we only plot the results over epochs to reduce the file size of the figures.



k ∼ Unif(S n ) RR: uniformly randomly shuffle the indices every epoch 5:

AGDA v.s. AGDA-RR.

Figure 1: Experimental results on quadratic games (5). Solid lines: average across 10 different runs. Shaded regions: 95% confidence intervals (±1.96 std). Dots: start/end of epochs. The vertical axes are on a logarithmic scale.

Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Problem setup 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Algorithms: simSGDA-RR & altSGDA-RR . . . . . . . . . . . . . . . . . . . . . 2.3 Assumptions and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Main results 3.1 Potential function V λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Main theorems: upper bounds of convergence rates . . . . . . . . . . . . . . . . . 4 Comparison with related works 4.1 Comparison with stochastic with-replacement setting . . . . . . . . . . . . . . . . 4.2 Comparison with other works on stochastic without-replacement setting . . . . . . 4.3 Comparison with deterministic setting . . . . . . . . . . . . . . . . . . . . . . . . 5 Lower bound for (full-batch) simGDA using separate step batch SGDA-RR and convergence rates B Technical propositions B.1 Function classes: PŁ condition, smoothness, and more . . . . . . . . . . . . . . . B.2 Without-replacement sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Basic recurrence inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C Proofs for (mini-batch) simultaneous SGDA-RR C.1 Warm-up: proof sketch for b = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Epoch-wise representations and bounding noise terms . . . . . . . . . . . . . . . . C.3 Recurrence inequalities for general smooth nonconvex-PŁ objective . . . . . . . . C.4 Convergence rates for smooth nonconvex-PŁ problem . . . . . . . . . . . . . . . . C.5 Convergence rates for smooth primal-PŁ-PŁ problem . . . . . . . . . . . . . . . . D Proofs for (mini-batch) alternating SGDA-RR: focusing on changes in the proof D.1 Epoch-wise representations and bounding noise terms . . . . . . . . . . . . . . . . D.2 Bounding noise terms: a bit different proof of Lemma 18 . . . . . . . . . . . . . . D.3 Recurrence inequalities for general smooth nonconvex-PŁ objective . . . . . . . . Published as a conference paper at ICLR 2023 D.4 Small step size assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E Proofs for lower bound of deterministic full-batch simGDA F Remark on smoothness assumptions and Lower bound of with-replacement SGD(A) G Experiments: quadratic games G.1 Parameter choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.2 Comparison: the effect of component discrepancy . . . . . . . . . . . . . . . . . . G.3 Comparison: the effect of condition number . . . . . . . . . . . . . . . . . . . . . G.4 Comparison: the effect of batch size . . . . . . . . . . . . . . . . . . . . . . . . . H Omitted comparison with related works H.1 Comparison with Xie et al. (2021) . . . . . . . . . . . . . . . . . . . . . . . . . .

(saddle point ⇒ global minimax & global maximin) This is straightforward by the definitions: for any x and y, min x g(x ; y) ≤ g(x * ; y) ≤ g(x * ; y * ) ≤ g(x; y * ) ≤ max y g(x; y ).

dy : we set d x = d y = d. G.1 PARAMETER CHOICES To sample the matrix C = 1 n n i=1 C i ∈ R d satisfying that µ C I d C and C i 2 ≤ L C , we first randomly generate an orthogonal matrix Q C ∈ R d×d (i.e., Q C Q C = I d ), by taking advantage of

Figure 2: Comparisons by changing the value of ∆ ∈ {10, 20, 40}. Solid lines: average across 10 different runs. Shaded regions: 95% confidence intervals (±1.96 std). The vertical axes are on a logarithmic scale.

κ2 = 20, AGDA(-RR)

Figure 3: Comparisons by changing the value of κ 2 = L/µ C ∈ {5, 10, 20}. Solid lines: average across 10 different runs. Shaded regions: 95% confidence intervals (±1.96 std). The vertical axes are on a logarithmic scale. Note: we run 1000 epochs for κ 2 = 20 (see the rightmost column), whereas we run 300 epochs for the other κ 2 ∈ {5, 10} (see the leftmost & middle columns).

Figure 4: Comparisons of simSGDA(-RR,-WORB,-NS) as changing b ∈ {1, 25, 50, 100}. Solid lines: average across 10 different runs. Shaded regions: 95% confidence intervals (±1.96 std). The vertical axes are on a logarithmic scale.

Throughout this appendix, we use X = R dx and Y = R dy . Given a closed set S ⊂ R d , we denote the set of all projection(s) of v ∈ R d onto S, i.e., the nearest point(s) in S from v, by Π S (v) := arg min w∈S vw .

ACKNOWLEDGMENTS

This work was supported by Institute of Information & communications Technology Planning & evaluation (IITP) grant (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)) funded by the Korea government (MSIT). The work was also supported by the National Research Foundation of Korea (NRF) grant (No. NRF-2019R1A5A1028324) funded by the Korea government (MSIT). CY acknowledges support from a grant funded by Samsung Electronics Co., Ltd.

annex

Case 3. (r ≥ κ 2 /c). Consider f (3) (x; y) = µ1 2 x 2 -L 2 y 2 . Clearly, f (3) ∈ F(L, µ 1 , L) ⊂ F(L, µ 1 , µ 2 ) and z * = (0, 0) is its unique saddle point. The GDA on f (3) can be written asx k , y k+1 = (1 -βL) y k .To guarantee (x k ; y k ) -(0, 0) 2 ≤ ε 2 , we need a large enough k to have x 2 k ≤ O(ε 2 ). Such a k is required to be at least Ω r βµ1 log(1/ε) . Also, we need β < 2/L to guarantee y k → 0 (i.e., otherwise, it diverges). Combining these facts, we eventually need Ω Lr µ1 log(1/ε) iterations.Case 4. (0 < r ≤ c/κ 1 ). Consider f (4) (x; y) = L 2 x 2 -µ2 2 y 2 . Clearly, f (4) ∈ F(L, L, µ 2 ) ⊂ F(L, µ 1 , µ 2 ) and z * = (0, 0) is its unique saddle point. The GDA on f (4) can be written asTo guarantee. Such a k is required to be at least Ω 1 βµ2 log(1/ε) . Also, we need β < 2r/L to guarantee x k → 0 (i.e., otherwise, it diverges). Combining these facts, we eventually need Ω L rµ2 log(1/ε) iterations.Lastly, we note that the lower iteration complexity bound in terms of the potential function V λ is equivalent to the complexity in terms of squared distance norm from the (unique) saddle point z * , up to constant factors. This is proved in Lemma 29 that we defer its proof.Here are the postponed/omitted proofs from the proof above. Proposition 27. For a square matrix A ∈ R m×m and a sequence of m-dimensional vectors (v k ), the matrix iteration v k+1 = Av k converges to v k → 0 if and only if the spectral radius (i.e., maximum absolute eigenvalue) of ρ(A) of A is less than 1. Furthermore, its convergence speed is characterized by O((ρ(A) + ε) k ) for any (arbitrarily small) ε > 0.Proof. See Horn & Johnson (2012, Theorem 5.6.10-12) .Proposition 28. Let µ 1 , µ 2 , and L be positive numbers such that L ≥ max{µ 1 , µ 2 }. Consider a quadratic function f on R × R defined byThen, f ∈ F(L, µ 1 , µ 2 ), and its unique saddle point is z * = (0, 0).Proof. The strong-convex-strong-concavity is trivially true. Note that the gradient and hessian of f isSince H is a non-singular matrix, f has a unique stationary point at origin (x = 0, y = 0). By Proposition 11, it is also a unique saddle & global minimax point.For any two distinct points z 1 = (x 1 ; y 1 ) andwhere H 2 is spectral norm (i.e., maximum singular value) of H. We would like to show that H 2 ≤ L. To this end, it is enough to verify the following two inequalities:the QR-decomposition of a random matrix. Then, we generate the eigenvalues of C i 's as follows.We sample the entries of n vectors λWe add some level of perturbations to some entries of each λ C i ; we replace some entries to the numbers in an interval [-L C , µ C ], keeping the entries of the vectorBecause of the perturbation step, some C i 's are not positive definite, thereby some components f i 's become non-(strongly-)concave in y.Next, we sample the matrix B i 's. There are no requirements for B but B i 2 ≤ L B ; B i 's are even not necessarily symmetric when d x = d y . Thus, we first generate the orthogonal matrices U B i and V B i by taking advantage of the singular value decomposition of random matrices. Then, we generate the singular values of B i 's by sampling the entries of n vectors σ B i uniformly from the interval [0, L C ]. After that, we defineWe typically want to take a larger L B than L C to strengthen the interaction term x By.Recall that the primal function Φ associated with f is explicitly written asNote that the inverse of C can be efficiently computed asBefore generating the matrices A i 's, we first generate M i 's satisfying that 1 Proof. Apply the eigendecomposition of M : M = QΛQ . Let M = Λ 1/2 Q . Then, we have Φ(x) = 1 2 M x 2 , which implies that Φ(x) ≥ 0 (∀x) and in fact Φ * = 0. Note that 1 2 x 2 is 1-strongly convex. Also, the minimum nonzero singular value ofTherefore, by the proof of Proposition 12, Φ(x) is a µ-PŁ function of x. Lastly, we note that Φ(x) is not strongly convex in general, especially when M is a rank-deficient matrix.Typically, the spectral norm M 2 is known to be bounded above by A 2 + L 2 B /µ C in worst-case (Nouiehed et al., 2019; Li et al., 2022) . However, since we sample M without knowing the exact form of A i 's while we want to control the spectral norm A i 2 not too large (for smoothness of f i ), we (empirically) decide to choose rather smaller L M : simply, we choose 28). We emphasize that A may have negative eigenvalues; the objective is nonconvex in x in general. We have checked this is true across the experimental settings. Also, we let L := max{ A 2 , L B , L C } for further parameter selection. (In fact, because of our choice of parameter values, L was always equal to L B in our experiments.) Furthermore, we generate the vectors u i 's and v i 's satisfyingThe entries of these vectors are uniformly sampled from an interval [-∆, ∆], thereby the average of entries is centered to zero. In addition, to verify our theory, we choose the step-sizes of the form β = c 1 • b /nL and α = c 0 • β /κ 2 2 for some constants c 0 and c 1 and batch size b. Lastly, we specify the values of parameters described above: n = 100, d = 25, µ M = µ C , andThe constants c 0 and c 1 are tuned among 10 {-2, -1.5, ±1, ±0.5, 0} . In the following subsections, we investigate the effects of the change of (i) ∆ ∈ {10, 20, 40}, determining the discrepancy between components, (ii) condition number κ 2 ∈ {5, 10, 20}, determined by L B and µ C , and (iii) batch size b ∈ {1, 25, 50, 100}, with the following symbols from our notation (K, 1, n, 0, 0, B, B, L, L, µ 2 , L(κ 2 + 1), V λ (z 1 0 ), α, β), and also put A = 0 (their analysis only applies uniformly bounded component variance per machine). Then we can naively translate the bound of Xie et al. (2021, Theorem 3) to our language asTo the best of our knowledge, however, we believe there may be a mistake in the proof of Xie et al. (2021, Appendix C.4 ). From the inequalities on the last page of their paper, we notice that the termµ 2 γKT might be missing in a step, where γ is chosen to be the minimum of several terms including 1 87L f K . Thus, as far as we can tell, it seems inevitable that this omitted term would lead to an additional termin the final bound. By combining this to their bound and re-translating it, we eventually have, since their L 2 12 /µ 2 translates to our κ 2 2 . Therefore, their result actually shows the same dependency on condition number κ 2 as our Theorem 1. Nevertheless, comparing the terms related to the component-wise variance B, ours is better. In the second term in the bound above does not shrink even when the number of iterations (per machine & per communication) grows. In our case (Theorem 1), however, the dominant term (in K) can be briefly written as O B nK 2 1/3 which can diminish with large n, i.e., the number of iterations per epoch.

