TIADA: A TIME-SCALE ADAPTIVE ALGORITHM FOR NONCONVEX MINIMAX OPTIMIZATION

Abstract

Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate timescale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvexstrongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.

1. INTRODUCTION

Adaptive gradient methods, such as AdaGrad (Duchi et al., 2011) , Adam (Kingma & Ba, 2015) and AMSGrad (Reddi et al., 2018) , have become the default choice of optimization algorithms in many machine learning applications owing to their robustness to hyper-parameter selection and fast empirical convergence. These advantages are especially prominent in nonconvex regime with success in training deep neural networks (DNN). Classic analyses of gradient descent for smooth functions require the stepsize to be less than 2/l, where l is the smoothness parameter and often unknown for complicated models like DNN. Many adaptive schemes, usually with diminishing stepsizes based on cumulative gradient information, can adapt to such parameters and thus reducing the burden of hyper-parameter tuning (Ward et al., 2020; Xie et al., 2020) . Such tuning-free algorithms are called parameter-agnostic, as they do not require any prior knowledge of problem-specific parameters, e.g., the smoothness or strong-convexity parameter. In this work, we aim to bring the benefits of adaptive stepsizes to solving the following problem: min x∈R d 1 max y∈Y f (x, y) = E ξ∈P [F (x, y; ξ)] , where P is an unknown distribution from which we can drawn i.i.d. samples, Y ⊂ R d2 is closed and convex, and f : R d1 × R d2 → R is nonconvex in x. We call x the primal variable and y the dual variable. This minimax formulation has found vast applications in modern machine learning, notably generative adversarial networks (Goodfellow et al., 2014; Arjovsky et al., 2017) , adversarial learning (Goodfellow et al., 2015; Miller et al., 2020) , reinforcement learning (Dai et al., 2017; Modi et al., 2021) , sharpness-aware minimization (Foret et al., 2021) , domain-adversarial training (Ganin et al., 2016) , etc. Albeit theoretically underexplored, adaptive methods are widely deployed in these applications in combination with popular minimax optimization algorithms such as (stochastic) gradient descent ascent (GDA), extragradient (EG) (Korpelevich, 1976) , and optimistic GDA (Popov, 1980; Rakhlin & Sridharan, 2013) ; see, e.g., (Gulrajani et al., 2017; Daskalakis et al., 2018; Mishchenko et al., 2020; Reisizadeh et al., 2020) , just to list a few. While it seems natural to directly extend adaptive stepsizes to minimax optimization algorithms, a recent work by Yang et al. (2022a) pointed out that such schemes may not always converge without 2) with L = 2 under a poor initial stepsize ratio, i.e., η x /η y = 5. Here, η x t and η y t are the effective stepsizes respectively for x and y, and κ is the condition numberfoot_0 . (a) shows the trajectory of the two algorithms and the background color demonstrates the function value f (x, y). In (b), while the effective stepsize ratio stays unchanged for AdaGrad, TiAda adapts to the desired time-scale separation 1/κ, which divides the training process into two stages. In (c), after entering Stage II, TiAda converges fast, whereas AdaGrad diverges. knowing problem-dependent parameters. Unlike the case of minimization, convergent analyses of GDA and EG for nonconvex minimax optimization are subject to time-scale separation (Bot ¸& Böhm, 2020; Lin et al., 2020a; Sebbouh et al., 2022; Yang et al., 2022b) -the stepsize ratio of primal and dual variables needs to be smaller than a problem-dependent threshold -which is recently shown to be necessary even when the objective is strongly concave in y with true gradients (Li et al., 2022) . Moreover, Yang et al. (2022a) showed that GDA with standard adaptive stepsizes, that chooses the stepsize of each variable based only on the (moving) average of its own past gradients, fails to adapt to the time-scale separation requirement. Take the following nonconvex-stronglyconcave function as a concrete example: f (x, y) = - 1 2 y 2 + Lxy - L 2 2 x 2 , where L > 0 is a constant. Yang et al. (2022a) proved that directly using adaptive stepsizes like AdaGrad, Adam and AMSGrad will fail to converge if the ratio of initial stepsizes of x and y (denoted as η x and η y ) is large. We illustrate this phenomenon in Figures 1(a ) and 1(c), where AdaGrad diverges. To sum up, adaptive stepsizes designed for minimization, are not time-scale adaptive for minimax optimization and thus not parameter-agnostic. To circumvent this time-scale separation bottleneck, Yang et al. (2022a) introduced an adaptive algorithm, NeAda, for problem (1) with nonconvex-strongly-concave objectives. NeAda is a two-loop algorithm built upon GDmax (Lin et al., 2020a ) that after one primal variable update, updates the dual variable for multiple steps until a stopping criterion is satisfied in the inner loop. Although the algorithm is agnostic to the smoothness and strong-concavity parameters, there are several limitations that may undermine its performance in large-scale training: (a) In the stochastic setting, it gradually increases the number of inner loop steps (k steps for the k-th outer loop) to improve the inner maximization problem accuracy, resulting in a possible waste of inner loop updates if the maximization problem is already well solved; (b) NeAda needs a large batchsize of order Ω ϵ -foot_1 to achieve the near-optimal convergence rate in theory; (c) It is not fully adaptive to the gradient noise, since it deploys different strategies for deterministic and stochastic settings. In this work, we address all of the issues above by proposing TiAda (Time-scale Adaptive Algorithm), a single-loop algorithm with time-scale adaptivity for minimax optimization. Specifically, one of our major modifications is setting the effective stepsize, i.e., the scale of (stochastic) gradient used in the updates, of the primal variable to the reciprocal of the maximum between the primal and dual variables' second moments, i.e., the sums of their past gradient norms. This ensures the effective stepsize ratio of x and y being upper bounded by a decreasing sequence, which eventually reaches the desired time-scale separation. Taking the test function (2) as an example, Figure 1 illustrates the time-scale adaptivity of TiAda: In Stage I, the stepsize ratio quickly decreases below the threshold; in Stage II, the ratio is stabilized and the gradient norm starts to converge fast. We focus on the minimax optimization (1) that is strongly-concave in y, since other nonconvex regimes are far less understood even without adaptive stepsizes. Moreover, near stationary point may not exist in nonconvex-nonconcave (NC-NC) problems and finding first-order local minimax point is already PPAD-complete (Daskalakis et al., 2021) . We consider a constraint for the dual variable, which is common in convex optimization with adaptive stepsizes (Levy, 2017; Levy et al., 2018) and in the minimax optimization with non-adaptive stepsizes (Lin et al., 2020a) . In summary, our contributions are as follows: • We introduce the first single-loop and fully parameter-agnostic adaptive algorithm, TiAda, for nonconvex-strongly-concave (NC-SC) minimax optimization. It adapts to the necessary timescale separation without large batchsize or any knowledge of problem-dependant parameters or target accuracy. TiAda finds an ϵ-stationary point with an optimal complexity of O ϵ -2 in the deterministic case, and a near-optimal sample complexity of O ϵ -(4+δ) for any small δ > 0 in the stochastic case. It shaves off the extra logarithmic terms in the complexity of NeAda with AdaGrad stepsize for both primal and dual variables (Yang et al., 2022a) . TiAda is proven to be noise-adaptive, which is the first of its kind among nonconvex minimax optimization algorithms. • While TiAda is based on AdaGrad stepsize, we generalize TiAda with other existing adaptive schemes, and conduct experiments on several tasks. The tasks include 1) test functions by Yang et al. (2022a) for showing the nonconvergence of GDA with adaptive schemes under poor initial stepsize ratios, 2) distributional robustness optimization (Sinha et al., 2018) on MNIST dataset with a NC-SC objective, and 3) training the NC-NC generative adversarial networks on CIFAR-10 dataset. In all tasks, we show that TiAda converges faster and is more robust compared with NeAda or GDA with other existing adaptive stepsizes.

1.1. RELATED WORK

Adaptive gradient methods. AdaGrad brings about an adaptive mechanism for gradient-based optimization algorithm that adjusts its stepsize by keeping the averaged past gradients. The original AdaGrad was introduced for online convex optimization and maintains coordinate-wise stepsizes. In nonconvex stochastic optimization, AdaGrad-Norm with one learning rate for all directions is shown to achieve the same complexity as SGD (Ward et al., 2020; Li & Orabona, 2019) , even with the high probability bound (Kavis et al., 2022; Li & Orabona, 2020) . In comparison, RMSProp (Hinton et al., 2012) and Adam (Kingma & Ba, 2015) use the decaying moving average of past gradients, but may suffer from divergence (Reddi et al., 2018) . Many variants of Adam are proposed, and a wide family of them, including AMSGrad, are provided with convergence guarantees (Zhou et al., 2018; Chen et al., 2018; Défossez et al., 2020; Zhang et al., 2022b) . One of the distinguishing traits of adaptive algorithms is that they can achieve order-optimal rates without knowledge about the problem parameters, such as smoothness and variance of the noise, even in nonconvex optimization (Ward et al., 2020; Levy et al., 2021; Kavis et al., 2019) . Adaptive minimax optimization algorithms. The adaptive stepsize schemes are naturally extended to minimax optimization, both in theory and practice, notably in the training of GANs (Goodfellow, 2016; Gidel et al., 2018) . In the convex-concave regime, several adaptive algorithms are designed based on EG and AdaGrad stepsize, and they inherit the parameter-agnostic characteristic (Bach & Levy, 2019; Antonakopoulos et al., 2019) . In sharp contrast, when the objective function is nonconvex about one variable, most existing adaptive algorithms require knowledge of the problem parameters (Huang & Huang, 2021; Huang et al., 2021; Guo et al., 2021) . Very recently, it was proved that a parameter-dependent ratio between two stepsizes is necessary for GDA in NC-SC minimax problems with non-adaptive stepsize (Li et al., 2022) and most existing adaptive stepsizes (Yang et al., 2022a) . Heusel et al. (2017) shows the two-time-scaled GDA with non-adaptive stepsize or Adam will converge, but assuming the existence of an asymptotically stable attractor. Other NC-SC minimax optimization algorithms. In the NC-SC setting, the most popular algorithms are GDA and GDmax, in which one primal variable update is followed by one or multiple steps of dual variable updates. Both of them can achieve O(ϵ -2 ) complexity in the deterministic setting and O(ϵ -4 ) sample complexity in the stochastic setting (Lin et al., 2020a; Chen et al., 2021; Nouiehed et al., 2019; Yang et al., 2020) , which are not improvable in the dependency on ϵ given the existing lower complexity bounds (Zhang et al., 2021; Li et al., 2021) . Later, several works further improved the dependency on the condition number with more complicated algorithms in deterministic (Yang et al., 2022b; Lin et al., 2020b) and stochastic settings (Zhang et al., 2022a) . All of the algorithms above do not use adaptive stepsizes and rely on knowledge of the problem parameters.

1.2. NOTATIONS

We denote l as the smoothness parameter, µ as the strong-concavity parameter, whose formal definitions will be introduced in Assumptions 3.1 and 3.2, and κ := l/µ as the condition number. We assume access to stochastic gradient oracle returning [∇ x F (x, y; ξ), ∇ y F (x, y; ξ)]. For the minimax problem (1), we denote y * (x) := arg max y∈Y f (x, y) as the solution of the inner maximization problem, Φ(x) := f (x, y * (x)) as the primal function, and P Y (•) as projection operator onto set Y. For notational simplicity, we will use the name of an existing adaptive algorithm to refer to the simple combination of GDA and it, i.e., setting the stepsize of GDA to that adaptive scheme separately for both x and y. For instance "AdaGrad" for minimax problems stands for the algorithm that uses AdaGrad stepsizes separately for x and y in GDA.

2. METHOD

We formally introduce the TiAda method in Algorithm 1, and the major difference with AdaGrad lies in line 5. Like AdaGrad, TiAda stores the accumulated squared (stochastic) gradient norm of the primal and dual variables in v x t and v y t , respectively. We refer to hyper-parameters η x and η y as the initial stepsizes, and the actual stepsizes for updating in line 5 as effective stepsizes which are denoted by η x t and η y t . TiAda adopts effective stepsizes η x t = η x / max v x t+1 , v y t+1 α and η y t = η y / v y t+1 β , while AdaGrad uses η x / v x t+1 1/2 and η y / v y t+1 1/2 . In Section 3, our theoretical analysis suggests to choose α > 1/2 > β. We will also illustrate in the next subsection that the max structure and different α, β make our algorithm adapt to the desired time-scale separation. For simplicity of analysis, similar to AdaGrad-Norm (Ward et al., 2020) , we use the norms of gradients for updating the effective stepsizes. A more practical coordinate-wise variant that can be used for high-dimensional models is presented in Section 4.1. Algorithm 1 TiAda (Time-scale Adaptive Algorithm) 1: Input: (x 0 , y 0 ), v x 0 > 0, v y 0 > 0, η x > 0, η y > 0, α > 0, β > 0 and α > β.  v x t+1 = v x t + ∥g x t ∥ 2 and v y t+1 = v y t + ∥g y t ∥ 2 5: x t+1 = x t - η x max{v x t+1 ,v y t+1 } α g x t and y t+1 = P Y y t + η y (v y t+1 ) β g y t 6: end for

2.1. THE TIME-SCALE ADAPTIVITY OF TIADA

Current analyses of GDA with non-adaptive stepsizes require the time-scale, η x t /η y t , to be smaller than a threshold depending on problem constants such as the smoothness and the strong-concavity parameter (Lin et al., 2020a; Yang et al., 2022b) . The intuition is that we should not aggressively update x if the inner maximization problem has not yet been solved accurately, i.e., we have not found a good approximation of y * (x). Therefore, the effective stepsize of x should be small compared with that of y. It is tempting to expect adaptive stepsizes to automatically find a suitable time-scale separation. However, the quadratic example (2) given by Yang et al. (2022a) shattered the illusion. In this example, the effective stepsize ratio stays the same along the run of existing adaptive algorithms, including AdaGrad (see Figure 1(b) ), Adam and AMSGrad, and they fail to converge if the initial stepsizes are not carefully chosen (see Yang et al. (2022a) for details). As v x t and v y t only separately contain the gradients of x and y, the effective stepsizes of two variables in these adaptive methods depend on their own history, which prevents them from cooperating to adjust the ratio. Now we explain how TiAda adapts to both the required time-scale separation and small enough stepsizes. First, the ratio of our modified effective stepsizes is upper bounded by a decreasing sequence when α > β: η x t η y t = η x / max v x t+1 , v y t+1 α η y / v y t+1 β ≤ η x / v y t+1 α η y / v y t+1 β = η x η y v y t+1 α-β , as v y t is the sum of previous gradient norms and is increasing. Regardless of the initial stepsize ratio η x /η y , we expect the effective stepsize ratio to eventually drop below the desirable threshold for convergence. On the other hand, the effective stepsizes for the primal and dual variables are also upper bounded by decreasing sequences, η x / v x t+1 α and η y / v y t+1 β , respectively. Similar to AdaGrad, such adaptive stepsizes will reduce to small enough, e.g., O(1/l), to ensure convergence. Another way to look at the effective stepsize of x is η x t = η x max v x t+1 , v y t+1 α = v x t+1 α max v x t+1 , v y t+1 α • η x v x t+1 α . (4) If the gradients of y are small (i.e., v y t+1 < v x t+1 ), meaning the inner maximization problem is well solved, then the first factor becomes 1 and the effective stepsize of x is just the second factor, similar to the AdaGrad updates. If the term v y t+1 dominates over v x t+1 , the first factor would be smaller than 1, allowing to slow down the update of x and waiting for a better approximation of y * (x). To demonstrate the time-scale adaptivity of TiAda, we conducted experiments on the quadratic minimax example (2) with L = 2. As shown in Figure 1 (b), while the effective stepsize ratio of AdaGrad stays unchanged for this particular function, TiAda progressively decreases the ratio. According to Lemma 2.1 of Yang et al. (2022a) , 1/κ is the threshold where GDA starts to converge. We label the time period before reaching this threshold as Stage I, during which as shown in Figure 1(c) , the gradient norm for TiAda increases. However, as soon as it enters Stage II, i.e., when the ratio drops below 1/κ, TiAda converges fast to the stationary point. In contrast, since the stepsize ratio of AdaGrad never reaches this threshold, the gradient norm keeps growing.

3. THEORETICAL ANALYSIS OF TIADA

In this section, we study the convergence of TiAda under NC-SC setting with both deterministic and stochastic gradient oracles. We make the following assumptions to develop our convergence results. Assumption 3.1 (smoothness). Function f (x, y) is l-smooth (l > 0) in both x and y, that is, for any x 1 , x 2 ∈ R d1 and y 1 , y 2 ∈ Y, we have max{∥∇ x f (x 1 , y 1 ) -∇ x f (x 2 , y 2 )∥, ∥∇ y f (x 1 , y 1 ) -∇ y f (x 2 , y 2 )∥} ≤ l (∥x 1 -x 2 ∥ + ∥y 1 -y 2 ∥) . Assumption 3.2 (strong-concavity in y). Function f (x, y) is µ-strongly-concave (µ > 0) in y, that is, for any x ∈ R d1 and y 1 , y 2 ∈ Y, we have f (x, y 1 ) ≥ f (x, y 2 ) + ⟨∇ y f (x, y 1 ), y 1 -y 2 ⟩ + µ 2 ∥y 1 -y 2 ∥ 2 . Assumption 3.3 (interior optimal point). For any x ∈ R d1 , y * (x) is in the interior of Y. Remark 3.1. The last assumption ensures ∇ y f (x, y * (x)) = 0, which is important for AdaGrad-like stepsizes that use the sum of squared norms of past gradients in the denominator. If the gradient about y is not 0 at y * (x), the stepsize will keep decreasing even near the optimal point, leading to slow convergence. This assumption could be potentially alleviated by using generalized AdaGrad stepsizes (Bach & Levy, 2019) . We aim to find a near stationary point for the minimax problem (1). Here, (x, y) is defined to be an ϵ stationary point if ∥∇ x f (x, y)∥ ≤ ϵ and ∥∇ y f (x, y)∥ ≤ ϵ in the deterministic setting, or E∥∇ x f (x, y)∥ 2 ≤ ϵ 2 and E∥∇ y f (x, y)∥ 2 ≤ ϵ 2 in the stochastic setting, where the expectation is taken over all the randomness in the algorithm. This stationarity notion can be easily translated to the near-stationarity of the primal function Φ(x) = max y∈Y (x, y) (Yang et al., 2022b) . Under our analyses, TiAda is able to achieve the optimal O ϵ -2 complexity in the deterministic setting and a near-optimal O ϵ -(4+δ) sample complexity for any small δ > 0 in the stochastic setting.

3.1. DETERMINISTIC SETTING

In this subsection, we assume to have access to the exact gradients of f (•, •), and therefore we can replace ∇ x F (x t , y t ; ξ x t ) and ∇ y F (x t , y t ; ξ y t ) by ∇ x f (x t , y t ) and ∇ y f (x t , y t ) in Algorithm 1. Theorem 3.1 (deterministic setting). Under Assumptions 3.1 to 3.3, Algorithm 1 with deterministic gradient oracles satisfies that for any 0 < β < α < 1, after T iterations, 1 T T -1 t=0 ∥∇ x f (x t , y t )∥ 2 + 1 T T -1 t=0 ∥∇ y f (x t , y t )∥ 2 ≤ O 1 T . This theorem implies that for any initial stepsizes, TiAda finds an ϵ-stationary point within O(ϵ -2 ) iterations. Such complexity is comparable to that of nonadaptive methods, such as vanilla GDA (Lin et al., 2020a) , and is optimal in the dependency of ϵ (Zhang et al., 2021) . Like NeAda (Yang et al., 2022a) , TiAda does not need any prior knowledge about µ and l, but it improves over NeAda by removing the logarithmic term in the complexity. Notably, we provide a unified analysis for a wide range of α and β, while most existing literature on AdaGrad-like stepsizes only validates a specific hyper-parameter, e.g., α = 1/2 in minimization problems (Ward et al., 2020; Kavis et al., 2019) .

3.2. STOCHASTIC SETTING

In this subsection, we assume the access to a stochastic gradient oracle, that returns unbiased noisy gradients, ∇ x F (x, y; ξ) and ∇ y F (x, y; ξ). Also, we make the following additional assumptions. Assumption 3.4 (stochastic gradients). For z ∈ {x, y}, we have E ξ [∇ z F (x, y, ξ)] = ∇ z f (x, y). In addition, there exists a constant G such that ∥∇ z F (x, y, ξ)∥ ≤ G for any x ∈ R d1 and y ∈ Y. Assumption 3.5 (bounded primal function value). There exists a constant Φ max ∈ R such that for any x ∈ R d1 , Φ(x) is upper bounded by Φ max . Remark 3.2. The bounded gradients and function value are assumed in many works on adaptive algorithms (Kavis et al., 2022; Levy et al., 2021) . This implies the domain of y is bounded, which is also assumed in the analyses of AdaGrad (Levy, 2017; Levy et al., 2018) . In neural networks with rectified activations, because of its scale-invariance property (Dinh et al., 2017) , imposing boundedness of y does not affect the expressiveness. Wasserstein GANs (Arjovsky et al., 2017 ) also use projections on the critic to restrain the weights on a small cube around the origin. Assumption 3.6 (second order Lipschitz continuity for y). For any Chen et al. (2021) also impose this assumption to achieve the optimal O ϵ -4 complexity for GDA with non-adaptive stepsizes for solving NC-SC minimax problems. Together with Assumption 3.3, we can show that y * (•) is smooth. Nevertheless, without this assumption, Lin et al. (2020a) only show a worse complexity of O ϵ -5 for GDA without large batchsize. Theorem 3.2 (stochastic setting). Under Assumptions 3.1 to 3.6, Algorithm 1 with stochastic gradient oracles satisfies that for any 0 < β < α < 1, after T iterations, x 1 , x 2 ∈ R d1 and y 1 , y 2 ∈ Y, there exists constant L such that ∇ 2 xy f (x 1 , y 1 ) -∇ 2 xy f (x 2 , y 2 ) ≤ L (∥x 1 -x 2 ∥ + ∥y 1 -y 2 ∥) and ∇ 2 yy f (x 1 , y 1 ) -∇ 2 yy f (x 2 , y 2 ) ≤ L (∥x 1 -x 2 ∥ + ∥y 1 -y 2 ∥). Remark 3.3. 1 T E T -1 t=0 ∥∇ x f (x t , y t )∥ 2 + T -1 t=0 ∥∇ y f (x t , y t )∥ 2 ≤ O T α-1 + T -α + T β-1 + T -β . TiAda can achieve the complexity arbitrarily close to the optimal sample complexity, O ϵ -4 (Li et al., 2021) , by choosing α and β arbitrarily close to 0.5. Specifically, TiAda achieves a complexity of O ϵ -(4+δ) for any small δ > 0 if we set α = 0.5 + δ/(8 + 2δ) and β = 0.5δ/(8 + 2δ). Notably, this matches the complexity of NeAda with AdaGrad stepsizes for both variables (Yang et al., 2022a) . NeAda may attain O(ϵ -4 ) complexity with more complicated subroutines for y. Theorem 3.2 implies that TiAda is fully agnostic to problem parameters, e.g., µ, l and σ. GDA with non-adaptive stepsizes (Lin et al., 2020a) and vanilla single-loop adaptive methods (Huang & Huang, 2021) , such as AdaGrad and AMSGrad, all require knowledge of these parameters. Compared with the only parameter-agnostic algorithm, NeAda, our algorithm has several advantages. First, TiAda is a single-loop algorithm, while NeAda (Yang et al., 2022a) needs increasing inner-loop steps and a huge batchsize of order Ω ϵ -2 to achieve its best complexity. Second, our stationary guarantee is for E∥∇ x f (x, y)∥ 2 ≤ ϵ 2 , which is stronger than E∥∇ x f (x, y)∥ ≤ ϵ guarantee in NeAda. Last but not least, although NeAda does not need to know the exact value of variance σ in the stochastic setting when σ > 0, NeAda uses a different stopping criterion for the inner loop in the deterministic setting when σ = 0, so it still needs partial information about σ. In comparison, TiAda achieves the (near) optimal complexity in both settings with the same strategy. Consistent with the intuition of time-scale adaptivity in Section 2.1, the convergence result can be derived in two stages. In Stage I, according to the upper bound of the ratio in Equation (3), we expect the term 1/ v y t+1 α-β reduces to a constant c, a desirable time-scale separation. This means that v y t+1 has to grow to nearly (1/c) 1/(α-β) . In Stage II, when the time-scale separation is satisfied, TiAda converges at a speed specified in Theorem 3.2. This indicates that the proximity between α and β affects the speed trade-off between Stage I and II. When α and β are close, we have a faster overall convergence rate close to the optimality, but suffer from a longer transition phase in Stage I, albeit by only a constant term. We also present an empirical ablation study on the convergence behavior with different choices of α and β in Appendix A.2. Remark 3.4. In TiAda, the update of x requires to know the gradients of y (or v y t+1 ). However, in some applications that concern about privacy, one player might not access the information about the other player (Koller & Pfeffer, 1995; Foster & Young, 2006; He et al., 2016) . Therefore, we also consider a variant of TiAda without taking the maximum of gradient norms, i.e., setting the effective stepsize of x in Algorithm 1 to η x / v x t+1 α . This variant achieves a sub-optimal complexity of O ϵ -6 . This result further justifies the importance of coordination between adaptive stepsizes of two players for achieving faster convergence in minimax optimization. The algorithm and convergence results are presented in Appendix C.4.

4. EXPERIMENTS

In this section, we first present extensions of TiAda that accommodate other adaptive schemes besides AdaGrad and are more practical in deep models. Then we present empirical results of TiAda and compare it with (i) simple combinations of GDA and adaptive stepsizes, which are commonly used in practice, and (ii) NeAda with different adaptive mechanisms (Yang et al., 2022a) . Our experiments include test functions proposed by Yang et al. (2022a) , the NC-SC distributional robustness optimization (Sinha et al., 2018) , and training the NC-NC Wasserstein GAN with gradient penalty (Gulrajani et al., 2017) . We believe that this not only validates our theoretical results but also shows the potential of our algorithm in real-world scenarios. To show the strength of being parameter-agnostic of TiAda, in all the experiments, we merely select α = 0.6 and β = 0.4 without further tuning those two hyper-parameters. All experimental details including the neural network structure and hyper-parameters are described in Appendix A.1.

4.1. EXTENSIONS TO OTHER ADAPTIVE STEPSIZES AND HIGH-DIMENSIONAL MODELS

Although we design TiAda upon AdaGrad-Norm, it is easy and intuitive to apply other adaptive schemes like Adam and AMSGrad. To do so, for z ∈ {x, y}, we replace the definition of g z t and v z t+1 in line 3 and 4 of Algorithm 1 to g z t = β z t g z t-1 + (1 -β z t )∇ z F (x t , y t ; ξ z t ), v z t+1 = ψ v 0 , ∥∇ z F (x i , y i ; ξ z i )∥ 2 t i=0 , where {β z t } is the momentum parameters and ψ is the second moment function. Some common stepsizes that fit in this generalized framework can be seen in Table 1 in the appendix. Since Adam is widely used in many deep learning tasks, we also implement generalized TiAda with Adam stepsizes in our experiments for real-world applications, and we label it "TiAda-Adam". Besides generalizing TiAda to accommodate different stepsize schemes, for high-dimensional models, we also provide a coordinate-wise version of TiAda. Note that we cannot simply change everything in Algorithm 1 to be coordinate-wise, because we use the gradients of y in the stepsize of x and there are no corresponding relationships between the coordinates of x and y. Therefore, in light of our intuition in Equation (4), we use the global accumulated gradient norms to dynamically adjust the stepsize of x. Denote the second moment (analogous to v x t+1 in Algorithm 1) for the i-th coordinate of x at the t-th step as v x t+1,i and globally v x t+1 := d1 i=1 v x t+1,i . We also use similar notations for y. Then, the update for the i-th parameter, i.e., x i and y i , can be written as Our results in the following subsections provide strong empirical evidence for the effectiveness of these TiAda variants, and developing convergence guarantees for them would be an interesting future work. We believe our proof techniques for TiAda, together with existing convergence results for coordinate-wise AdaGrad and AMSGrad (Zhou et al., 2018; Chen et al., 2018; Défossez et al., 2020) , can shed light on the theoretical analyses of these variants.    x i t+1 = x i t - (v x t+1 ) α max{v x t+1 ,v y t+1 } α • η x (v x t+1,i ) α ∇ x i f (x t , y t ) y i t+1 = y i t + η y (v y t+1,i ) β ∇ y i f (x t , y t ).

4.2. TEST FUNCTIONS

Firstly, we examine TiAda on the quadratic function ( 2) that shows the non-convergence of simple combinations of GDA and adaptive stepsizes (Yang et al., 2022a) . Since our TiAda is based on AdaGrad, we compare it to GDA with AdaGrad stepsize and NeAda-AdaGrad (Yang et al., 2022a) . The results are shown in the first row of Figure 2 . When the initial ratio is poor, TiAda and NeAda-AdaGrad always converge while AdaGrad diverges. NeAda also suffers from slow convergence when the initial ratio is poor, e.g., 1 and 1/2 after 2000 iterations. In contrast, TiAda automatically balances the stepsizes and converges fast under all ratios. For the stochastic case, we follow Yang et al. (2022a) and conduct experiments on the McCormick function which is more complicated and 2-dimensional: f (x, y) = sin(x 1 +x 2 )+(x 1 -x 2 ) 2 -3 2 x 1 + 5 2 x 2 +1+x 1 y 1 +x 2 y 2 -1 2 (y 2 1 +y 2 2 ) . TiAda consistently outperforms AdaGrad and NeAda-AdaGrad as demonstrated in the second row of Figure 2 regardless of the initial ratio. In this function, we also run an ablation study on the effect of our design that uses max-operator in the update of x. We compare TiAda with and its variant without the max-operator, TiAda without MAX (Algorithm 2 in the appendix) whose effective stepsizes of x are η x / v x t+1 α . According to Figure 2 (h), TiAda converges to smaller gradient norms under all configurations of α and β.

4.3. DISTRIBUTIONAL ROBUSTNESS OPTIMIZATION

In this subsection, we consider the distributional robustness optimization (Sinha et al., 2018) . We target training the model weights, the primal variable x, to be robust to the perturbations in the image inputs, the dual variable y. The problem can be formulated as: min x max y=[y1,...,yn] 1 n n i=1 f i (x, y i ) -γ∥y i -v i ∥ 2 , ( ) where f i is the loss function of the i-th sample, v i is the i-th input image, and y i is the corresponding perturbation. There are a total of n samples and γ is a trade-off hyper-parameter between the original loss and the penalty of the perturbations. If γ is large enough, the problem is NC-SC. We conduct the experiments on the MNIST dataset (LeCun, 1998) . In the left two plots of Figure 3 , we compare TiAda with AdaGrad and NeAda-AdaGrad in terms of convergence. Since it is common in practice to update y 15 times after each x update (Sinha et al., 2018) for better generalization error, Here we present two sets of stepsize configurations for the comparisons of AdaGrad-like and Adam-like algorithms. Please refer to Appendix A.3 for extensive experiments on larger ranges of stepsizes, and it will be shown that TiAda is the best among all stepsize combinations in our grid. we implement AdaGrad using both single and 15 iterations of inner loop (update of y). In order to show that TiAda is more robust to the initial stepsize ratio, we compare two sets of initial stepsize configurations with two different ratios. In both cases, TiAda outperforms NeAda and AdaGrad, especially when η x = η y = 0.1, the performance gap is large. In the right two plots of Figure 3 , the Adam variants are compared. In this case, we find that TiAda is not only faster, but also more stable comparing to Adam with one inner loop iteration. Another successful and popular application of minimax optimization is generative adversarial networks. In this task, a discriminator (or critic) is trained to distinguish whether an image is from the dataset. At the same time, a generator is mutually trained to synthesize samples with the same distribution as the training dataset so as to fool the discriminator. We use WGAN-GP loss (Gulrajani et al., 2017) , which imposes the discriminator to be a 1-Lipschitz function, with CIFAR-10 dataset (Krizhevsky et al., 2009) in our experiments.

4.4. GENERATIVE ADVERSARIAL NETWORKS

Since TiAda is a single-loop algorithm, for fair comparisons, we also update the discriminator only once for each generator update in Adam. In Figure 4 , we plot the inception scores (Salimans et al., 2016) of TiAda-Adam and Adam under different initial stepsizes. We use the same color for the same initial stepsizes, and different line styles to distinguish the two methods, i.e., solid lines for TiAda-Adam and dashed lines for Adam. For all the three initial stepsizes we consider, TiAda-Adam achieves higher inception scores. Also, TiAda-Adam is more robust to initial stepsize selection, as the gap between different solid lines at the end of training is smaller than the dashed lines.

5. CONCLUSION

In this work, we bring in adaptive stepsizes to nonconvex minimax problems in a parameter-agnostic manner. We designed the first time-scale adaptive algorithm, TiAda, which progressively adjusts the effective stepsize ratio and reaches the desired time-scale separation. TiAda is also noise adaptive and does not require large batchsizes compared with the existing parameter-agnostic algorithm for nonconvex minimax optimization. Furthermore, TiAda is able to achieve optimal and near-optimal complexities respectively wtih deterministic and stochastic gradient oracles. We also empirically showcased the advantages of TiAda over NeAda and GDA with adaptive stepsizes on several tasks, including simple test functions, as well as NC-SC and NC-NC real-world applications. It remains an interesting problem to study whether TiAda can escape stationary points that are not local optimum, like adaptive methods for minimization problems (Staib et al., 2019) . A SUPPLEMENTARY TO EXPERIMENTS AdaGrad (TiAda) β t = 0 v 0 + t i=0 u 2 i GDA β t = 0 1 Adam 0 < β t < 1 γ t+1 v 0 + (1 -γ) t i=0 γ t-i u 2 i AMSGrad 0 < β t < 1 max m=0,...,t γ m+1 v 0 + (1 -γ) m i=0 γ m-i u 2 i A.1 EXPERIMENTAL DETAILS In this section, we will summarize the experimental settings and hyper-parameters used. As we mentioned, since we try to develop a parameter-agnostic algorithm without tuning the hyper-parameters much, if not specified, we simply use α = 0.6 and β = 0.4 for all experiments. For fair comparisons, we used the same hyper-parameters when comparing our TiAda with other algorithms. Test Functions For Figure 1 (2022a) . We set the batchsize as 128, and for the Adam-like optimizers, including Adam, NeAda-Adam and TiAda-Adam, we use β 1 = 0.9, β 2 = 0.999 for the first moment and second moment parameters. Generative Adversarial Networks For this part, we use the code adapted from Green9 (2018). To produce the results in Figure 4 , a four layer CNN and a four layer CNN with transpose convolution layers are used respectively for the discriminator and generator. Following a similar setting as Daskalakis et al. (2018) , we set batchsize as 512, the dimension of latent variable as 50 and the weight of gradient penalty term as 10 -4 . For the Adam-like optimizers, we set β 1 = 0.5, β 2 = 0.9. To get the inception score, we feed the pre-trained inception network with 8000 synthesized samples. A.2 ABLATION STUDY ON CONVERGENCE BEHAVIOR WITH DIFFERENT α AND β We conduct experiments on the quadratic minimax problem (2) with L = 2 to study the effect of hyper-parameters α and β on the convergence behavior of TiAda. As discussed in Sections 1 and 3.2, we refer to the period before the stepsize ratio reduce to the convergence threshold as Stage I, and the period after that as Stage II. In order to accentuate the difference between these two stages, we pick a large initial stepsize ratio η x /η y = 20. We compare 4 different pairs of α and β: α ∈ {0.59, 0.6, 0.61, 0.62} and β = 1α. From Figure 5 , we observed that as soon as TiAda enters Stage II, the norm of gradients start to drop. Moreover, the closer α and β are to 0.5, the more time TiAda remains in Stage I, which confirms the intuitions behind our analysis in Section 3.2. 

A.3 ADDITIONAL EXPERIMENTS ON DISTRIBUTIONAL ROBUSTNESS OPTIMIZATION

We use a grid of stepsize combinations to evaluate TiAda and compare it with NeAda and GDA with corresponding adaptive stepsizes. For AdaGrad-like algorithms, we use {0.1, 0.05, 0.01, 0.0005} for both η x and η y , and the results are reported in Figure 6 . For Adam-like algorithms, we use {0.001, 0.0005, 0.0001} for η x and {0.1, 0.05, 0.005, 0.001} for η y , and the results are shown in Figure 7 . We note that since Adam uses the reciprocal of the moving average of gradient norms, it is extremely unstable when the gradients are small. Therefore, Adam-like algorithms often experience instability when they are near stationary points.

B HELPER LEMMAS

Lemma B.1 (Lemma A.2 in Yang et al. (2022a)). Let x 1 , ..., x T be a sequence of non-negative real numbers, x 1 > 0 and 0 < α < 1. Then we have When α = 1, we have  T t=1 x t 1-α ≤ T t=1 x t t k=1 x k α ≤ 1 1 -α T t=1 x t 1-α . + l(L+Lκ) µ 2 , ∥∇y * (x 1 ) -∇y * (x 2 )∥ ≤ L∥x 1 -x 2 ∥.

C PROOFS

For notational convenience in the proofs, we denote the stochastic gradient as ∇ x f (x t , y t ) and ∇ y f (x t , y t ). Also denote y * t = y * (x t ), η t = η x max{v x t+1 ,v y t+1 } α , γ t = η y (v y t+1 ) β , Φ * = min x∈R d 1 Φ(x), and ∆Φ = Φ max -Φ * . We use 1 as the indicator function. We present a formal version of Theorem 3.1. Theorem C.1 (deterministic setting). Under Assumptions 3.1 to 3.3, Algorithm 1 with deterministic gradient oracles satisfies that for any 0 < β < α < 1, after T iterations, T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ max {5C 1 , 2C 2 } , where C 1 = v x 0 + 2∆Φ η x 1 1-α + 4κle (1-α)(1-log v x 0 )/2 e(1 -α) (v x 0 ) 2α-1 2 1-α 1 2α≥1 + 2κl 1 -2α 1 α 1 2α<1 + c 1 c 5 η x 1 1-α + 2c 1 c 4 η x e (1-α)(1-log v x 0 )/2 e(1 -α) (v x 0 ) 2α-β-1 2 1-α 1 2α-β≥1 + c 1 c 4 η x 1 -2α + β 1 α-β 1 2α-β<1 C 2 = v x 0 + 2∆Φ + c 1 c 5 η x (v x 0 ) 1-2α+β + c 1 c 4 η x 1 -2α + β + 2κle (1-2α+β)(1-log v x 0 ) e(1 -2α + β) (v x 0 ) 2α-1 1 2α≥1 + 2κl (1 -2α) (v x 0 ) β 1 2α<1 c 5 (v x 0 ) 1-2α+β + c 4 (η x ) 2 1 -2α + β α 1-β 1 1-(1-2α+β) ( 1+ α 1-β ) 1 2α-β<1 + 2∆Φ + c 1 c 5 η x (v x 0 ) 1/4 + 8κle (1-log v x 0 )/4 e (v x 0 ) 2α-1 + 4c 1 c 4 η x e (1-log v x 0 )/4 e (v x 0 ) 2α-β-1 c 5 (v x 0 ) (1-β) 4α + 4c 4 α (η x ) 2 e (1-β)(1-log v x 0 )/(4α) e(1 -β) (v x 0 ) 2α-β-1 α 1-β 2 1 2α≥1 , with ∆Φ = Φ(x 0 ) -Φ * , c 1 = η x κ 2 η y v y t0 α-β , c 2 = max 4η y µl µ + l , η y (µ + l) , c 3 = 4(µ + l) 1 µ 2 + η y v y t0 β c 1/β 2 , c 4 = (µ + l) 2κ 2 (v y 0 ) α + (µ + l)κ 2 η y µl , c 5 = c 3 + η y v y 0 (v y 0 ) β + η y c 1-β β 2 1 -β . In addition, denoting the above upper bound for T -1 t=0 ∥∇ x f (x t , y t )∥ 2 as C 3 , we have T -1 t=0 ∥∇ y f (x t , y t )∥ 2 ≤ c 5 + c 4 (η x ) 2 1 + log C 3 -log v x 0 (v x 0 ) 2α-β-1 1 2α-β≥1 + C 1-2α+β 3 1 -2α + β 1 2α-β<1 1 1-β . Proof. Let us start from the smoothness of the primal function Φ(•). By Lemma B.2, Φ(x t+1 ) ≤ Φ(x t ) -η t ⟨Φ(x t+1 ), ∇ x f (x t , y t )⟩ + klη 2 t ∥∇ x f (x t , y t )∥ 2 = Φ(x t ) -η t ∥∇ x f (x t , y t )∥ 2 + η t ⟨∇ x f (x t , y t ) -∇Φ(x t ), ∇ x f (x t , y t )⟩ + klη 2 t ∥∇ x f (x t , y t )∥ 2 ≤ Φ(x t ) -η t ∥∇ x f (x t , y t )∥ 2 + η t 2 ∥∇ x f (x t , y t )∥ 2 + η t 2 ∥∇ x f (x t , y t ) -∇Φ(x t )∥ 2 + klη 2 t ∥∇ x f (x t , y t )∥ 2 = Φ(x t ) - η t 2 ∥∇ x f (x t , y t )∥ 2 + klη 2 t ∥∇ x f (x t , y t )∥ 2 + η t 2 ∥∇ x f (x t , y t ) -∇Φ(x t )∥ 2 = Φ(x t ) - η t 2 ∥∇ x f (x t , y t )∥ 2 + klη 2 t ∥∇ x f (x t , y t )∥ 2 + η x 2 max v x t+1 , v y t+1 α ∥∇ x f (x t , y t ) -∇Φ(x t )∥ 2 ≤ Φ(x t ) - η t 2 ∥∇ x f (x t , y t )∥ 2 + klη 2 t ∥∇ x f (x t , y t )∥ 2 + η x 2 v y t0 α-β v y t+1 β ∥∇ x f (x t , y t ) -∇Φ(x t )∥ 2 ≤ Φ(x t ) - η t 2 ∥∇ x f (x t , y t )∥ 2 + klη 2 t ∥∇ x f (x t , y t )∥ 2 + η x κ 2 2 v y t0 α-β v y t+1 β ∥∇ y f (x t , y t )∥ 2 ≤ Φ(x t ) - η t 2 ∥∇ x f (x t , y t )∥ 2 + klη 2 t ∥∇ x f (x t , y t )∥ 2 + η x κ 2 2η y v y t0 α-β • γ t ∥∇ y f (x t , y t )∥ 2 , where in the second to last inequality, we used the strong-concavity of f (x, •): ∥∇ x f (x t , y t ) -∇Φ(x t )∥ ≤ l∥y t -y * t ∥ ≤ κ∥∇ y f (x t , y t )∥.

Telescoping and rearranging the terms, we have

T -1 t=0 η t ∥∇ x f (x t , y t )∥ 2 ≤ 2 (Φ(x 0 ) -Φ * ) ∆Φ +2κl T -1 t=0 η 2 t ∥∇ x f (x t , y t )∥ 2 + η x κ 2 η y v y t0 α-β c1 T -1 t=0 γ t ∥∇ y f (x t , y t )∥ 2 = 2∆Φ + T -1 t=0 2κlη x max v x t+1 , v y t+1 2α ∥∇ x f (x t , y t )∥ 2 + c 1 T -1 t=0 γ t ∥∇ y f (x t , y t )∥ 2 ≤ 2∆Φ + T -1 t=0 2κlη x v x t+1 2α ∥∇ x f (x t , y t )∥ 2 + c 1 T -1 t=0 γ t ∥∇ y f (x t , y t )∥ 2 ≤ 2∆Φ + 2κlη x 1 + log v x T -log v x 0 (v x 0 ) 2α-1 • 1 2α≥1 + (v x T ) 1-2α 1 -2α • 1 2α<1 + c 1 T -1 t=0 γ t ∥∇ y f (x t , y t )∥ 2 . ( ) Published as a conference paper at ICLR 2023

We proceed to bound

T -1 t=0 γ t ∥∇ y f (x t , y t )∥ 2 . Let t 0 be the first iteration such that v y t0+1 β > c 2 := max 4η y µl µ+l , η y (µ + l) . We have v y t0 ≤ c 1/β 2 , and for t ≥ t 0 , y t+1 -y * t+1 2 ≤ (1 + λ t )∥y t+1 -y * t ∥ 2 + 1 + 1 λ t y * t+1 -y * t 2 ≤ (1 + λ t ) ∥y t -y * t ∥ 2 + (η y ) 2 v y t+1 2β ∥∇ y f (x t , y t )∥ 2 + 2η y v y t+1 β ⟨y t -y * t , ∇ y f (x t , y t )⟩ (A) + 1 + 1 λ t y * t+1 -y * t 2 , where λ t > 0 will be determined later. For l-smooth and µ-strongly convex function g(x), according to Theorem 2.1.12 in Nesterov (2003) , we have ⟨∇g(x) -∇g(y), x -y⟩ ≥ µl µ + l ∥x -y∥ 2 + 1 µ + l ∥∇g(x) -∇g(y)∥ 2 . Therefore, Term (A) ≤ (1 + λ t ) 1 - 2η y µl (µ + l) v y t+1 β ∥y t -y * t ∥ 2 + (η y ) 2 v y t+1 2β - 2η y (µ + l) v y t+1 β ∥∇ y f (x t , y t )∥ 2 . Let λ t = η y µl (µ+l)(v y t+1 ) β -2η y µl . Note that λ t > 0 after t 0 . Then Term (A) ≤ 1 - η y µl (µ + l) v y t+1 β ∥y t -y * t ∥ 2 + (1 + λ t ) (η y ) 2 v y t+1 2β - 2η y (µ + l) v y t+1 β ∥∇ y f (x t , y t )∥ 2 ≤ ∥y t -y * t ∥ 2 + (1 + λ t ) (η y ) 2 v y t+1 2β - 2η y (µ + l) v y t+1 β (B) ∥∇ y f (x t , y t )∥ 2 . As 1 + λ t ≥ 1 and v y t+1 β ≥ η y (µ + l), we have term (B) ≤ - η y (µ+l)(v y t+1 ) β . Putting them back, we can get y t+1 -y * t+1 2 ≤ ∥y t -y * t ∥ 2 - η y (µ + l) v y t+1 β ∥∇ y f (x t , y t )∥ 2 + 1 + 1 λ t y * t+1 -y * t 2 ≤ ∥y t -y * t ∥ 2 - η y (µ + l) v y t+1 β ∥∇ y f (x t , y t )∥ 2 + (µ + l) v y t+1 β η y µl y * t+1 -y * t 2 ≤ ∥y t -y * t ∥ 2 - η y (µ + l) v y t+1 β ∥∇ y f (x t , y t )∥ 2 + (µ + l)κ 2 v y t+1 β η y µl ∥x x+1 -x t ∥ 2 = ∥y t -y * t ∥ 2 - η y (µ + l) v y t+1 β ∥∇ y f (x t , y t )∥ 2 + (µ + l)κ 2 v y t+1 β η 2 t η y µl ∥∇ x f (x t , y t )∥ 2 . Then, by telescoping, we have T -1 t=t0 η y (µ + l) v y t+1 β ∥∇ y f (x t , y t )∥ 2 ≤ y t0 -y * t0 2 + T -1 t=t0 (µ + l)κ 2 v y t+1 β η 2 t η y µl ∥∇ x f (x t , y t )∥ 2 . ( ) For the first term in the RHS, using Young's inequality with τ to be determined later, we have y t0 -y * t0 2 ≤ 2 y t0 -y * t0-1 2 + 2 y * t0 -y * t0-1 2 = 2 P Y (y t0-1 + γ t0-1 ∇ y f (x t0-1 , y t0-1 )) -y * t0-1 2 + 2 y * t0 -y * t0-1 2 ≤ 2 y t0-1 + γ t0-1 ∇ y f (x t0-1 , y t0-1 ) -y * t0-1 2 + 2 y * t0 -y * t0-1 2 ≤ 4 y t0-1 -y * t0-1 2 + γ 2 t0-1 ∥∇ y f (x t0-1 , y t0-1 )∥ 2 + 2 y * t0 -y * t0-1 2 ≤ 4 1 µ 2 ∥∇ y f (x t0-1 , y t0-1 )∥ 2 + γ 2 t0-1 ∥∇ y f (x t0-1 , y t0-1 )∥ 2 + 2 y * t0 -y * t0-1 2 = 4 1 µ 2 + γ 2 t0-1 ∥∇ y f (x t0-1 , y t0-1 )∥ 2 + 2 y * t0 -y * t0-1 2 ≤ 4 1 µ 2 + γ 2 0 v y t0 + 2 y * t0 -y * t0-1 2 ≤ 4 1 µ 2 + η y v y t0 β c 1/β 2 + 2 y * t0 -y * t0-1 2 ≤ 4 1 µ 2 + η y v y t0 β c 1/β 2 + 2κ 2 ∥x t0 -x t0-1 ∥ 2 ≤ 4 1 µ 2 + η y v y t0 β c 1/β 2 + 2κ 2 η 2 t0-1 ∥∇ x f (x t0-1 , y t0-1 )∥ 2 ≤ 4 1 µ 2 + η y v y t0 β c 1/β 2 + 2κ 2 v y t+1 β (v y 0 ) β η 2 t0-1 ∥∇ x f (x t0-1 , y t0-1 )∥ 2 . Combined with Equation ( 7), we have T -1 t=t0 η y v y t+1 β ∥∇ y f (x t , y t )∥ 2 ≤ 4(µ + l) 1 µ 2 + η y v y t0 β c 1/β 2 c3 + (µ + l) 2κ 2 (v y 0 ) α + (µ + l)κ 2 η y µl c4 T -1 t=t0-1 v y t+1 β η 2 t ∥∇ x f (x t , y t )∥ 2 . By adding terms from 0 to t 0 -1 and η y v y 0 (v y 0 ) β from both sides, we have η y v y 0 (v y 0 ) β + T -1 t=0 η y v y t+1 β ∥∇ y f (x t , y t )∥ 2 ≤ c 3 + η y v y 0 (v y 0 ) β + c 4 T -1 t=0 v y t+1 β η 2 t ∥∇ x f (x t , y t )∥ 2 + t0-1 t=t=0 η y v y t+1 β ∥∇ y f (x t , y t )∥ 2 ≤ c 3 + η y v y 0 (v y 0 ) β + c 4 T -1 t=0 v y t+1 β η 2 t ∥∇ x f (x t , y t )∥ 2 + η y v y 0 (v y 0 ) β + t0-1 t=t=0 η y v y t+1 β ∥∇ y f (x t , y t )∥ 2 ≤ c 3 + η y v y 0 (v y 0 ) β + c 4 T -1 t=0 v y t+1 β η 2 t ∥∇ x f (x t , y t )∥ 2 + η y 1 -β v 1-β t0 ≤ c 3 + η y v y 0 (v y 0 ) β + c 4 T -1 t=0 v y t+1 β η 2 t ∥∇ x f (x t , y t )∥ 2 + η y c 1-β β 2 1 -β = c 3 + η y v y 0 (v y 0 ) β + η y c 1-β β 2 1 -β + c 4 (η x ) 2 T -1 t=0 v y t+1 β max v x t+1 , v y t+1 2α ∥∇ x f (x t , y t )∥ 2 = c 3 + η y v y 0 (v y 0 ) β + η y c 1-β β 2 1 -β c5 +c 4 (η x ) 2 T -1 t=0 1 v x t+1 2α-β ∥∇ x f (x t , y t )∥ 2 ≤ c 5 + c 4 (η x ) 2 1 + log v x T -log v x 0 (v x 0 ) 2α-β-1 • 1 2α-β≥1 + (v x T ) 1-2α+β 1 -2α + β • 1 2α-β<1 . The LHS can be bounded by (v y T ) 1-β by Lemma B.1. Then we get two useful inequalities from above:        T -1 t=0 γ t ∥∇ y f (x t , y t )∥ 2 ≤ c 5 + c 4 (η x ) 2 1+log v x T -log v x 0 (v x 0 ) 2α-β-1 • 1 2α-β≥1 + (v x T ) 1-2α+β 1-2α+β • 1 2α-β<1 v y T ≤ c 5 + c 4 (η x ) 2 1+log v x T -log v x 0 (v x 0 ) 2α-β-1 • 1 2α-β≥1 + (v x T ) 1-2α+β 1-2α+β • 1 2α-β<1 1 1-β . (8) Now bring it back to Equation ( 6), we get T -1 t=0 η t ∥∇ x f (x t , y t )∥ 2 ≤ 2∆Φ + 2κlη x 1 + log v x T -log v x 0 (v x 0 ) 2α-1 • 1 2α≥1 + (v x T ) 1-2α 1 -2α • 1 2α<1 + c 1 c 5 + c 1 c 4 (η x ) 2 1 + log v x T -log v x 0 (v x 0 ) 2α-β-1 • 1 2α-β≥1 + (v x T ) 1-2α+β 1 -2α + β • 1 2α-β<1 . For the LHS, we have T -1 t=0 η t ∥∇ x f (x t , y t )∥ 2 = T -1 t=0 η x max v x t+1 , v y t+1 α ∥∇ x f (x t , y t )∥ 2 ≥ η x max {v x T , v y T } α T -1 t=0 ∥∇ x f (x t , y t )∥ 2 From here, by combining two inequalites above and noting that T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ v x T , we can already conclude that T -1 t=0 ∥∇ x f (x t , y t )∥ 2 = O(1) . Now we will provide an explicit bound. We consider two cases: (1) If v y T ≤ v x T , then T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ 2∆Φ (v x T ) α η x + 2κl (v x T ) α (1 + log v x T -log v x 0 ) (v x 0 ) 2α-1 • 1 2α≥1 + (v x T ) 1-α 1 -2α • 1 2α<1 + c 1 c 5 (v x T ) α η x + c 1 c 4 η x (v x T ) α (1 + log v x T -log v x 0 ) (v x 0 ) 2α-β-1 • 1 2α-β≥1 + (v x T ) 1-α+β 1 -2α + β • 1 2α-β<1 = 2∆Φ (v x T ) α η x + 2κl (v x T ) α (v x T ) 1-α 2 (v x T ) α-1 2 (1 + log v x T -log v x 0 ) (v x 0 ) 2α-1 • 1 2α≥1 + (v x T ) 1-α 1 -2α • 1 2α<1 + c 1 c 5 (v x T ) α η x + c 1 c 4 η x (v x T ) α (v x T ) 1-α 2 (v x T ) α-1 2 (1 + log v x T -log v x 0 ) (v x 0 ) 2α-β-1 • 1 2α-β≥1 + (v x T ) 1-α+β 1 -2α + β • 1 2α-β<1 ≤ 2∆Φ (v x T ) α η x + 2κl 2e (1-α)(1-log v x 0 )/2 (v x T ) 1+α 2 e(1 -α) (v x 0 ) 2α-1 • 1 2α≥1 + (v x T ) 1-α 1 -2α • 1 2α<1 + c 1 c 5 (v x T ) α η x + c 1 c 4 η x 2e (1-α)(1-log v x 0 )/2 (v x T ) 1+α 2 e(1 -α) (v x 0 ) 2α-β-1 • 1 2α-β≥1 + (v x T ) 1-α+β 1 -2α + β • 1 2α-β<1 , where we used x -m (c + log x) ≤ e cm em for x > 0, m > 0 and c ∈ R in the last inequality. Also, if 0 < α i < 1 and b i are positive constants, and x ≤ n i=1 b i x αi , then we get x ≤ n n i=1 b 1/(1-αi) i . Now consider v x T as the x in the previous statement, and note that the LHS of Equation ( 9) equals to v x T -v x 0 . Then we can get v x T ≤ 5v x 0 + 5 2∆Φ η x 1 1-α + 5 4κle (1-α)(1-log v x 0 )/2 e(1 -α) (v x 0 ) 2α-1 2 1-α • 1 2α≥1 + 5 2κl 1 -2α 1 α • 1 2α<1 + 5 c 1 c 5 η x 1 1-α + 5 2c 1 c 4 η x e (1-α)(1-log v x 0 )/2 e(1 -α) (v x 0 ) 2α-β-1 2 1-α • 1 2α-β≥1 + 5 c 1 c 4 η x 1 -2α + β 1 α-β • 1 2α-β<1 . Note that the RHS is a constant and also an upper bound for T -1 t=0 ∥∇ x f (x t , y t )∥ 2 . (2) If v y T ≤ v x T , then we can use the upper bound for v y T from Equation (8). We now discuss two cases: 1. 2α < 1 + β. Then we have T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ 2∆Φ + c 1 c 5 η x + 2κl 1 + log v x T -log v x 0 (v x 0 ) 2α-1 • 1 2α≥1 + (v x T ) 1-2α 1 -2α • 1 2α<1 + c 1 c 4 η x (v x T ) 1-2α+β 1 -2α + β c 5 + c 4 (η x ) 2 (v x T ) 1-2α+β 1 -2α + β α 1-β ≤ 2∆Φ + c 1 c 5 η x (v x 0 ) 1-2α+β + 2κl 1 + log v x T -log v x 0 (v x 0 ) 2α-1 (v x T ) 1-2α+β • 1 2α≥1 + 1 (1 -2α) (v x 0 ) β • 1 2α<1 + c 1 c 4 η x 1 -2α + β c 5 (v x 0 ) 1-2α+β + c 4 (η x ) 2 1 -2α + β α 1-β • (v x T ) 1-2α+β+ (1-2α+β)α 1-β ≤ 2∆Φ + c 1 c 5 η x (v x 0 ) 1-2α+β + 2κl e (1-2α+β)(1-log v x 0 ) e(1 -2α + β) (v x 0 ) 2α-1 • 1 2α≥1 + 1 (1 -2α) (v x 0 ) β • 1 2α<1 + c 1 c 4 η x 1 -2α + β c 5 (v x 0 ) 1-2α+β + c 4 (η x ) 2 1 -2α + β α 1-β • (v x T ) 1-2α+β+ (1-2α+β)α 1-β , Note that since α > β, we have 1 -2α + β + (1 -2α + β)α 1 -β ≤ (1 -α)α 1 -β + 1 -α = 1 + α(β -α) 1 -β < 1. Therefore, with the same reasoning as Equation ( 10), T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ v x T ≤ 2 2∆Φ + c 1 c 5 η x (v x 0 ) 1-2α+β + c 1 c 4 η x 1 -2α + β + 2κle (1-2α+β)(1-log v x 0 ) e(1 -2α + β) (v x 0 ) 2α-1 • 1 2α≥1 + 2κl (1 -2α) (v x 0 ) β • 1 2α<1 c 5 (v x 0 ) 1-2α+β + c 4 (η x ) 2 1 -2α + β α 1-β 1 1-(1-2α+β) ( 1+ α 1-β ) + 2v x 0 , which gives us constant RHS. 2. 2α ≥ 1 + β. Then we have T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ 2∆Φ + c 1 c 5 η x + 2κl (1 + log v x T -log v x 0 ) (v x 0 ) 2α-1 + c 1 c 4 η x (1 + log v x T -log v x 0 ) (v x 0 ) 2α-β-1 c 5 + c 4 (η x ) 2 (1 + log v x T -log v x 0 ) (v x 0 ) 2α-β-1 α 1-β ≤ 2∆Φ + c 1 c 5 η x (v x 0 ) 1/4 + 2κl (1 + log v x T -log v x 0 ) (v x 0 ) 2α-1 (v x T ) 1/4 + c 1 c 4 η x (1 + log v x T -log v x 0 ) (v x 0 ) 2α-β-1 (v x T ) 1/4 c 5 (v x 0 ) (1-β) 4α + c 4 (η x ) 2 (1 + log v x T -log v x 0 ) (v x 0 ) 2α-β-1 (v x T ) (1-β) 4α α 1-β • (v x T ) 1/2 ≤ 2∆Φ + c 1 c 5 η x (v x 0 ) 1/4 + 8κle (1-log v x 0 )/4 e (v x 0 ) 2α-1 + 4c 1 c 4 η x e (1-log v x 0 )/4 e (v x 0 ) 2α-β-1 c 5 (v x 0 ) (1-β) 4α + 4c 4 α (η x ) 2 e (1-β)(1-log v x 0 )/(4α) e(1 -β) (v x 0 ) 2α-β-1 α 1-β • (v x T ) 1/2 , which implies T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ v x T ≤ 2 2∆Φ + c 1 c 5 η x (v x 0 ) 1/4 + 8κle (1-log v x 0 )/4 e (v x 0 ) 2α-1 + 4c 1 c 4 η x e (1-log v x 0 )/4 e (v x 0 ) 2α-β-1 c 5 (v x 0 ) (1-β) 4α + 4c 4 α (η x ) 2 e (1-β)(1-log v x 0 )/(4α) e(1 -β) (v x 0 ) 2α-β-1 α 1-β 2 + 2v x 0 . Now we also get only a constant on the RHS. Summarizing all the cases, we finish the proof.

C.2 INTERMEDIATE LEMMAS FOR THEOREM 3.2

Lemma C.1. Under the same setting as Theorem 3.2, if for t = t 0 to t 1 -1 and any λ t > 0, S t , y t+1 -y * t+1 2 ≤ (1 + λ t ) ∥y t+1 -y * t ∥ 2 + S t , then we have E t1-1 t=t0 (f (x t , y * t ) -f (x t , y t )) ≤ E t1-1 t=t0+1 1 -γ t µ 2γ t ∥y t -y * t ∥ 2 - 1 2γ t (1 + λ t ) y t+1 -y * t+1 2 + E t1-1 t=t0 γ t 2 ∇ y f (x t , y t ) 2 + E t1-1 t=t0 S t 2γ t (1 + λ t ) . Proof. Letting λ t := µη y 2(v y t+1 ) β , we have y t+1 -y * t+1 2 ≤ (1 + λ t )∥y t+1 -y * t ∥ 2 + S t = (1 + λ t ) P Y y t + γ t ∇ y f (x t , y t ) -y * t 2 + S t ≤ (1 + λ t ) y t + γ t ∇ y f (x t , y t ) -y * t 2 + S t = (1 + λ t ) ∥y t -y * t ∥ 2 + γ 2 t ∇ y f (x t , y t ) 2 + 2γ t ∇ y f (x t , y t ), y t -y * t + S t = (1 + λ t ) ∥y t -y * t ∥ 2 + γ 2 t ∇ y f (x t , y t ) 2 + 2γ t ∇ y f (x t , y t ), y t -y * t + γ t µ∥y t -y * t ∥ 2 -γ t µ∥y t -y * t ∥ 2 + S t

By multiplying

1 γt(1+λt) and rearranging the terms, we can get 2 ∇ y f (x t , y t ), y * t -y t -µ∥y t -y * t ∥ 2 ≤ 1 -γ t µ γ t ∥y t -y * t ∥ 2 - 1 γ t (1 + λ t ) y t+1 -y * t+1 2 + γ t ∇ y f (x t , y t ) 2 + S t γ t (1 + λ t ) . By telescoping from t = t 0 to t 1 -1, we have t1-1 t=t0 ∇ y f (x t , y t ), y * t -y t - µ 2 ∥y t -y * t ∥ 2 ≤ t1-1 t=t0+1 1 -γ t µ 2γ t ∥y t -y * t ∥ 2 - 1 2γ t (1 + λ t ) y t+1 -y * t+1 2 + t1-1 t=t0 γ t 2 ∇ y f (x t , y t ) 2 + t1-1 t=t0 S t 2γ t (1 + λ t ) . Now we take the expectation and get E [LHS] ≥ E t1-1 t=t0 E ξ y t ∇ y f (x t , y t ), y * t -y t - µ 2 ∥y t -y * t ∥ 2 = E t1-1 t=t0 ⟨∇ y f (x t , y t ), y * t -y t ⟩ - µ 2 ∥y t -y * t ∥ 2 ≥ E t1-1 t=t0 (f (x t , y * t ) -f (x t , y t )) , where we used strong-concavity in the last inequality. Lemma C.2. Under the same setting as Theorem 3.2, if v y t+1 ≤ C for t = 0, ..., t 0 -1, then we have E t0-1 t=0 (f (x t , y * t ) -f (x t , y t )) ≤ E t0-1 t=0 1 -γ t µ 2γ t ∥y t -y * t ∥ 2 - 1 γ t (2 + µγ t ) y t+1 -y * t+1 2 + E t0-1 t=0 γ t 2 ∇ y f (x t , y t ) 2 + κ 2 µη y C β + 2C 2β (η x ) 2 2µ (η y ) 2 E 1 + log v x t0 -log v x 0 (v x 0 ) 2α-1 • 1 α≥0.5 + v x t0 1-2α 1 -2α • 1 α<0.5 . Proof. By Young's inequality, we have y t+1 -y * t+1 2 ≤ (1 + λ t )∥y t+1 -y * t ∥ 2 + 1 + 1 λ t y * t+1 -y * t 2 . Then letting λ t = µγt 2 and by Lemma C.1, we have E t0-1 t=0 (f (x t , y * t ) -f (x t , y t )) ≤ E t0-1 t=0 1 -γ t µ 2γ t ∥y t -y * t ∥ 2 - 1 γ t (2 + µγ t ) y t+1 -y * t+1 2 + E t0-1 t=0 γ t 2 ∇ y f (x t , y t ) 2 + E   t0-1 t=0 1 + 2 µγt γ t (2 + µγ t ) y * t+1 -y * t 2   . We now remain to bound the last term: E   t0-1 t=0 1 + 2 µγt γ t (2 + µγ t ) y * t+1 -y * t 2   ≤ E   t0-1 t=0 1 + 2 µγt 2γ t y * t+1 -y * t 2   = E t0-1 t=0 µη y v y t+1 β + 2 v y t+1 2β 2µ (η y ) 2 y * t+1 -y * t 2 ≤ µη y C β + 2C 2β 2µ (η y ) 2 E t0-1 t=0 y * t+1 -y * t 2 . By Lemma B.2 we have t0-1 t=0 y * t+1 -y * t 2 ≤ κ 2 t0-1 t=0 ∥x t+1 -x t ∥ 2 = κ 2 t0-1 t=0 η 2 t ∇ x f (x t , y t ) 2 = κ 2 (η x ) 2 t0-1 t=0 1 max v x t+1 , v y t+1 2α ∇ x f (x t , y t ) 2 ≤ κ 2 (η x ) 2 t0-1 t=0 1 v x t+1 2α ∇ x f (x t , y t ) 2 ≤ κ 2 (η x ) 2 v x 0 (v x 0 ) 2α + t0-1 t=0 1 v x t+1 2α ∇ x f (x t , y t ) 2 ≤ κ 2 (η x ) 2 1 + log v x t0 -log v x 0 (v x 0 ) 2α-1 • 1 α≥0.5 + v x t0 1-2α 1 -2α • 1 α<0.5 where we applied Lemma B.1 in the last inequality. Bringing back this result, we finish the proof. Lemma C.3. Under the same setting as Theorem 3.2, if t 0 is the first iteration such that v y t0+1 > C, then we have E T -1 t=t0 (f (x t , y * t ) -f (x t , y t )) ≤ E T -1 t=t0 1 -γ t µ 2γ t ∥y t -y * t ∥ 2 - 1 γ t (2 + µγ t ) y t+1 -y * t+1 2 + E T -1 t=t0 γ t 2 ∇ y f (x t , y t ) 2 + κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β (η x ) 2 2(1 -α)η y (v y 0 ) α-β E (v x T ) 1-α + 2κ 2 (η x ) 2 µ (η y ) 2 C 2α-2β E T -1 t=t0 ∥∇ x f (x t , y t )∥ 2 + 1 µ + η y (v y 0 ) β 4κη x G 2 η y (v y 0 ) α E (v y T ) β . Proof. By the Lipschitzness of y * (•) as in Lemma B.2, we have y t+1 -y * t+1 2 = ∥y t+1 -y * t ∥ 2 + y * t -y * t+1 2 + 2 y t+1 -y * t , y * t -y * t+1 ≤ ∥y t+1 -y * t ∥ 2 + κ 2 η 2 t ∇ x f (x t , y t ) 2 + 2 y t+1 -y * t , y * t -y * t+1 ≤ ∥y t+1 -y * t ∥ 2 + κ 2 η 2 t ∇ x f (x t , y t ) 2 -2 (y t+1 -y * t ) ⊺ ∇y * (x t ) (x t+1 -x t ) (C) + 2 (y t+1 -y * t ) ⊺ y * t -y * t+1 + ∇y * (x t ) (x t+1 -x t ) (D) . For Term (C), by the Cauchy-Schwarz and Lipschitzness of y * (•), -2 (y t+1 -y * t ) ⊺ ∇y * (x t ) (x t+1 -x t ) = 2η t (y t+1 -y * t ) ⊺ ∇y * (x t )∇ x f (x t , y t ) + 2η t (y t+1 -y * t ) ⊺ ∇y * (x t ) ∇ x f (x t , y t ) -∇ x f (x t , y t ) ≤ 2η t ∥y t+1 -y * t ∥∥∇y * (x t )∥∥∇ x f (x t , y t )∥ + 2η t (y t+1 -y * t ) ⊺ ∇y * (x t ) ∇ x f (x t , y t ) -∇ x f (x t , y t ) ≤ 2∥y t+1 -y * t ∥κη t ∥∇ x f (x t , y t )∥ + 2η t (y t+1 -y * t ) ⊺ ∇y * (x t ) ∇ x f (x t , y t ) -∇ x f (x t , y t ) ≤ λ t ∥y t+1 -y * t ∥ 2 + κ 2 η 2 t λ t ∥∇ x f (x t , y t )∥ 2 + 2η t (y t+1 -y * t ) ⊺ ∇y * (x t ) ∇ x f (x t , y t ) -∇ x f (x t , y t ) , where we used Young's inequality in the last step and λ t > 0 will be determined later. For Term (D), according to Cauchy-Schwarz and the smoothness of y * (•) as shown in Lemma B.3, 2 (y t+1 -y * t ) ⊺ y * t -y * t+1 + ∇y * (x t ) (x t+1 -x t ) ≤ 2∥y t+1 -y * t ∥ y * t -y * t+1 + ∇y * (x t ) (x t+1 -x t ) ≤ 2∥y t+1 -y * t ∥ • L 2 ∥x t+1 -x t ∥ 2 = Lη 2 t ∥y t+1 -y * t ∥ ∇ x f (x t , y t ) 2 ≤ Lη 2 t ∥y t+1 -y * t ∥G • ∇ x f (x t , y t ) ≤ τ LG 2 η 2 t 2 ∥y t+1 -y * t ∥ 2 + Lη 2 t 2τ ∇ x f (x t , y t ) 2 , where in the last step we used Young's inequality and τ > 0. Therefore, in total, we have y t+1 -y * t+1 2 ≤ 1 + λ t + τ LG 2 η 2 t 2 ∥y t+1 -y * t ∥ 2 + κ 2 + L 2τ η 2 t ∇ x f (x t , y t ) 2 + κ 2 η 2 t λ t ∥∇ x f (x t , y t )∥ 2 + 2η t (y t+1 -y * t ) ⊺ ∇y * (x t ) ∇ x f (x t , y t ) -∇ x f (x t , y t ) . Note that we can upper bound η t by η t = η x max v x t+1 , v y t+1 α ≤ η x v y t+1 α ≤ η x (v y 0 ) α , η t ≤ η x v y t+1 α = η x v y t+1 α-β v y t+1 β ≤ η x (v y 0 ) α-β v y t+1 β , which, plugged into the previous result, implies y t+1 -y * t+1 2 ≤ 1 + λ t + τ LG 2 (η x ) 2 2 (v y 0 ) 2α-β v y t+1 β ∥y t+1 -y * t ∥ 2 + κ 2 + L 2τ η 2 t ∇ x f (x t , y t ) 2 + κ 2 η 2 t λ t ∥∇ x f (x t , y t )∥ 2 + 2η t (y t+1 -y * t ) ⊺ ∇y * (x t ) ∇ x f (x t , y t ) -∇ x f (x t , y t ) . Now we choose λ t = µη y 4(v y t+1 ) β and τ = µη y (v y 0 ) 2α-β LG 2 (η x ) 2 , and get y t+1 -y * t+1 2 ≤ 1 + µη y 2 v y t+1 β ∥y t+1 -y * t ∥ 2 + κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β η 2 t ∇ x f (x t , y t ) 2 + 4κ 2 v y t+1 β η 2 t µη y ∥∇ x f (x t , y t )∥ 2 + 2η t (y t+1 -y * t ) ⊺ ∇y * (x t ) ∇ x f (x t , y t ) -∇ x f (x t , y t ) . Then Lemma C.1 gives us E T -1 t=t0 (f (x t , y * t ) -f (x t , y t )) ≤ E T -1 t=t0 1 -γ t µ 2γ t ∥y t -y * t ∥ 2 - 1 γ t (2 + µγ t ) y t+1 -y * t+1 2 + E T -1 t=t0 γ t 2 ∇ y f (x t , y t ) 2 + E T -1 t=t0 1 γ t (2 + µγ t ) κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β η 2 t ∇ x f (x t , y t ) 2 (E) + E T -1 t=t0 4κ 2 v y t+1 β η 2 t γ t (2 + µγ t )µη y ∥∇ x f (x t , y t )∥ 2 (F) + E T -1 t=t0 2η t γ t (2 + µγ t ) (y t+1 -y * t ) ⊺ ∇y * (x t ) ∇ x f (x t , y t ) -∇ x f (x t , y t ) (G) Now we proceed to bound each term. Term (E) Term (E) ≤ κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β E T -1 t=t0 η 2 t 2γ t ∇ x f (x t , y t ) 2 = κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β E T -1 t=t0 (η x ) 2 v y t+1 β 2η y max v x t+1 , v y t+1 2α ∇ x f (x t , y t ) 2 ≤ κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β E T -1 t=t0 (η x ) 2 v y t+1 β 2η y v y t+1 β v y t+1 α-β v x t+1 α ∇ x f (x t , y t ) 2 ≤ κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β E T -1 t=t0 (η x ) 2 2η y (v y 0 ) α-β v x t+1 α ∇ x f (x t , y t ) 2 ≤ κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β E (η x ) 2 2η y (v y 0 ) α-β v x 0 (v x 0 ) α + T -1 t=0 1 v x t+1 α ∇ x f (x t , y t ) 2 ≤ κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β (η x ) 2 2(1 -α)η y (v y 0 ) α-β E (v x T ) 1-α , where we used Lemma B.1 in the last step. Term (F) Term (F) ≤ E T -1 t=t0 2κ 2 v y t+1 β η 2 t γ t µη y ∥∇ x f (x t , y t )∥ 2 = 2κ 2 (η x ) 2 µ (η y ) 2 E T -1 t=t0 v y t+1 2β max v x t+1 , v y t+1 2α ∥∇ x f (x t , y t )∥ 2 ≤ 2κ 2 (η x ) 2 µ (η y ) 2 E T -1 t=t0 v y t+1 2β v y t+1 2α ∥∇ x f (x t , y t )∥ 2 ≤ 2κ 2 (η x ) 2 µ (η y ) 2 E 1 v y t0+1 2α-2β T -1 t=t0 ∥∇ x f (x t , y t )∥ 2 ≤ 2κ 2 (η x ) 2 µ (η y ) 2 C 2α-2β E T -1 t=t0 ∥∇ x f (x t , y t )∥ 2 Term (G) For simplicity, denote m t := 2 γt(2+µγt) (y t+1 -y * t ) ⊺ ∇y * (x t ) ∇ x f (x t , y t ) -∇ x f (x t , y t ) Since y * (•) is κ-Lipschitz as in Lemma B.2, |m t | can be upper bounded as |m t | ≤ 1 γ t ∥y t+1 -y * t ∥∥∇y * (x t )∥ ∇ x f (x t , y t ) + ∥∇ x f (x t , y t )∥ ≤ κ γ t ∥y t+1 -y * t ∥ ∇ x f (x t , y t ) + ∥∇ x f (x t , y t )∥ ≤ κ γ t P Y y t + γ t ∇ y f (x t , y t ) -y * t ∇ x f (x t , y t ) + ∥∇ x f (x t , y t )∥ ≤ κ γ t y t + γ t ∇ y f (x t , y t ) -y * t ∇ x f (x t , y t ) + ∥∇ x f (x t , y t )∥ ≤ κ γ t ∥y t -y * t ∥ + γ t ∇ y f (x t , y t ) ∇ x f (x t , y t ) + ∥∇ x f (x t , y t )∥ ≤ κ γ t 1 µ ∥∇ y f (x t , y t )∥ + γ t ∇ y f (x t , y t ) ∇ x f (x t , y t ) + ∥∇ x f (x t , y t )∥ ≤ 2Gκ γ T -1 G µ + η y G (v y 0 ) β M . Also note that γ t and y t+1 does not depend on ξ x t , so E ξ x t [m t ] = 0. Next, we look at Term (G). Term (G) = E T -1 t=t0 η t m t = E η t0 m t0 + T -1 t=t0+1 η t-1 m t + T -1 t=t0+1 (η t -η t-1 ) m t ≤ E η x (v y 0 ) α M + T -1 t=t0+1 η t-1 E ξ x t [m t ] + T -1 t=t0+1 (η t-1 -η t ) (-m t ) ≤ E η x (v y 0 ) α M + T -1 t=t0+1 (η t-1 -η t ) M ≤ E 2η x (v y 0 ) α M = 1 µ + η y (v y 0 ) β 4κη x G 2 η y (v y 0 ) α E (v y T ) β . Summarizing all the results, we finish the proof. Lemma C.4. Under the same setting as Theorem 3.2, we have E T -1 t=0 1 -γ t µ 2γ t ∥y t -y * t ∥ 2 - 1 γ t (2 + µγ t ) y t+1 -y * t+1 2 ≤ (v y 0 ) β G 2 2µ 2 η y + (2βG) 1 1-β +2 G 2 4µ 1 1-β +3 (η y ) 1 1-β +2 (v y 0 ) 2-2β . Proof. E T -1 t=0 1 -γ t µ 2γ t ∥y t -y * t ∥ 2 - 1 γ t (2 + µγ t ) y t+1 -y * t+1 2 ≤ (v y 0 ) β 2η y - µ 2 ∥y 0 -y * 0 ∥ 2 + 1 2η y T -1 t=1 v y t+1 β - µη y 2 -(v y t ) β - µ 2 (η y ) 2 4 (v y t ) β + 2µη y ∥y t -y * t ∥ ≤ (v y 0 ) β G 2 2µ 2 η y + 1 2η y T -1 t=1 v y t+1 β - µη y 2 -(v y t ) β ∥y t -y * t ∥ 2 (H) . For Term (H),we will bound it using the same strategy as in (Yang et al., 2022a) . The general idea is to show that v y t+1 βµη y 2 -(v y t ) β is positive for only a constant number of times. If the term is positive at iteration t, then we have 0 < v y t+1 β -(v y t ) β - µη y 2 = v y t + ∇ y f (x t , y t ) 2 β -(v y t ) β - µη y 2 = (v y t ) β   1 + ∇ y f (x t , y t ) 2 v y t    β -(v y t ) β - µη y 2 ≤ (v y t ) β   1 + β ∇ y f (x t , y t ) 2 v y t    -(v y t ) β - µη y 2 = β ∇ y f (x t , y t ) 2 (v y t ) 1-β - µη y 2 , where in the last inequality we used Bernoulli's inequality. By rearranging the terms, we have the two following conditions      ∇ y f (x t , y t ) 2 > µη y 2β (v y t ) 1-β ≥ µη y 2β (v y 0 ) 1-β (v y t ) 1-β < 2β µη y ∇ y f (x t , y t ) 2 ≤ 2βG µη y , This indicates that at each time the term is positive, the gradient norm must be large enough and the accumulated gradient norm, i.e., v y t+1 , must be small enough. Therefore, we can have at most 2βG µη y 1 1-β µη y 2β (v y 0 ) 1-β constant number of iterations when the term is positive. When the term is positive, it is also upper bounded by using the result from Equation (11): v y t+1 β - µη y 2 -(v y t ) β ∥y t -y * t ∥ 2 ≤ β ∇ y f (x t , y t ) 2 (v y t ) 1-β ∥y t -y * t ∥ 2 ≤ βG 2 (v y 0 ) 1-β ∥y t -y * t ∥ 2 ≤ βG 2 µ 2 (v y 0 ) 1-β ∥∇ y f (x t , y t )∥ 2 ≤ βG 4 µ 2 (v y 0 ) 1-β which is a constant. In total, Term (H) is bounded by (2βG) 1 1-β +2 G 2 2µ 1 1-β +3 (η y ) 1 1-β +1 (v y 0 ) 2-2β . Bringing it back, we get the desired result. Lemma C.5. Under the same setting as Theorem 3.2, for any constant C, we have E T -1 t=0 (f (x t , y * t ) -f (x t , y t )) ≤ 2κ 2 (η x ) 2 µ (η y ) 2 C 2α-2β E T -1 t=0 ∥∇ x f (x t , y t )∥ 2 + η y 2(1 -β) E (v y T ) 1-β + 1 µ + η y (v y 0 ) β 4κη x G 2 η y (v y 0 ) α E (v y T ) β + κ 2 µη y C β + 2C 2β (η x ) 2 2µ (η y ) 2 E 1 + log v x T -log v x 0 (v x 0 ) 2α-1 • 1 α≥0.5 + (v x T ) 1-2α 1 -2α • α<0.5 + κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β (η x ) 2 2(1 -α)η y (v y 0 ) α-β E (v x T ) 1-α + (v y 0 ) β G 2 2µ 2 η y + (2βG) 1 1-β +2 G 2 4µ 1 1-β +3 (η y ) 1 1-β +2 (v y 0 ) 2-2β . Proof. By Lemma C.2 and Lemma C.3, we have for any constant C, E T -1 t=0 (f (x t , y * t ) -f (x t , y t )) ≤ E T -1 t=0 1 -γ t µ 2γ t ∥y t -y * t ∥ 2 - 1 γ t (2 + µγ t ) y t+1 -y * t+1 2 + E T -1 t=0 γ t 2 ∇ y f (x t , y t ) 2 + 2κ 2 (η x ) 2 µ (η y ) 2 C 2α-2β E T -1 t=0 ∥∇ x f (x t , y t )∥ 2 + κ 2 µη y C β + 2C 2β (η x ) 2 2µ (η y ) 2 E 1 + log v x T -log v x 0 (v x 0 ) 2α-1 • 1 α≥0.5 + (v x T ) 1-2α 1 -2α • α<0.5 + κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β (η x ) 2 2(1 -α)η y (v y 0 ) α-β E (v x T ) 1-α + 1 µ + η y (v y 0 ) β 4κη x G 2 η y (v y 0 ) α E (v y T ) β . The first term can be bounded by Lemma C.4. For the second term, we have E T -1 t=0 γ t 2 ∇ y f (x t , y t ) 2 = E T -1 t=0 η y 2 v y t+1 β ∇ y f (x t , y t ) 2 ≤ η y 2 E v y 0 (v y 0 ) β + T -1 t=0 1 v y t+1 β ∇ y f (x t , y t ) ≤ η y 2(1 -β) E (v y T ) 1-β , where the last inequality follows from Lemma B.1. Then the proof is completed.

C.3 PROOF OF THEOREM 3.2

We present a formal version of Theorem 3.2. Theorem C.2 (stochastic setting). Under Assumptions 3.1 to 3.6, Algorithm 1 with stochastic gradient oracles satisfies that for any 0 < β < α < 1, after T iterations, E 1 T T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ 4∆ΦG 2α η x T 1-α + 4lκη x 1 -α + κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β 2lκ (η x ) 2 (1 -α)η y (v y 0 ) α-β G 2(1-α) T α + 2lκη y G 2(1-β) (1 -β)T β + 1 µ + η y (v y 0 ) β 16lκ 2 η x G 2(1+β) η y (v y 0 ) α T 1-β + 2κ 4 µη y C β + 2C 2β (η x ) 2 (η y ) 2 1 + log(G 2 T ) -log v x 0 (v x 0 ) 2α-1 T • 1 α≥0.5 + G 2(1-2α) (1 -2α)T 2α • 1 α<0.5 + 2κ 2 (v y 0 ) β G 2 µη y T + lκ (2βG) 1 1-β +2 G 2 µ 1 1-β +3 (η y ) 1 1-β +2 (v y 0 ) 2-2β T , and E 1 T T -1 t=0 ∥∇ y f (x t , y t )∥ 2 ≤ 4κ 3 (η x ) 2 (η y ) 2 C 2α-2β E 1 T T -1 t=0 ∥∇ x f (x t , y t )∥ 2 + lη y G 2-2β (1 -β)T β + 1 µ + η y (v y 0 ) β 8lκη x G 2+2β η y (v y 0 ) α T 1-β + κ 3 µη y C β + 2C 2β (η x ) 2 (η y ) 2 1 + log T G 2 -log v x 0 (v x 0 ) 2α-1 T • 1 α≥0.5 + G 2-4α (1 -2α)T 2α • 1 α<0.5 + κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β l (η x ) 2 G 2-2α (1 -α)η y (v y 0 ) α-β T α + κ (v y 0 ) β G 2 µη y T + 2l (2βG) 1 1-β +2 G 2 4µ 1 1-β +3 (η y ) 1 1-β +2 (v y 0 ) 2-2β T . Proof. By smoothness of the primal function, we have Φ(x t+1 ) -Φ(x t ) ≤ -η t ∇Φ(x t ), ∇ x f (x t , y t ) + lκη 2 t ∇ x f (x t , y t ) 2 . By multiplying 1 ηt on both sides and taking the expectation w.r.t. the noise of current iteration, we have . E Φ(x t+1 ) -Φ(x t ) η t ≤ -⟨∇Φ(x t ), ∇ x f (x t , (12) Term (I) 2E T -1 t=0 Φ(x t ) -Φ(x t+1 ) η t ≤ 2E Φ(x 0 ) η 0 - Φ(x T ) η T -1 + T -1 t=1 Φ(x t ) 1 η t - 1 η t-1 ≤ 2E Φ max η 0 - Φ * η T -1 + T -1 t=1 Φ max 1 η t - 1 η t-1 = 2E ∆Φ η T -1 = 2E ∆Φ η x max {v x T , v y T } α . Term (J) 2lκ T -1 t=0 E η t ∇ x f (x t , y t ) 2 = 2lκE T -1 t=0 η x max v x t+1 , v y t+1 α ∇ x f (x t , y t ) 2 ≤ 2lκη x E T -1 t=0 1 v x t+1 α ∇ x f (x t , y t ) 2 ≤ 2lκη x E v x 0 (v x 0 ) α + T -1 t=0 1 v x t+1 α ∇ x f (x t , y t ) 2 ≤ 2lκη x 1 -α E (v x T ) 1-α . Term (K) According to the smoothness of f (x t , •), we have E T -1 t=0 ∥∇ x f (x t , y t ) -∇Φ(x t )∥ 2 ≤ l 2 E T -1 t=0 ∥y t -y * t ∥ 2 ≤ 2lκE T -1 t=0 (f (x t , y * t ) -f (x t , y t )) , where the last inequality follows the strong-concavity of y. Now we let C = 8lκ 3 (η x ) 2 µ (η y ) 2 1 2α-2β , and apply Lemma C.5, in total, we have E T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ 1 2 E T -1 t=0 ∥∇ x f (x t , y t )∥ 2 + 2E ∆Φ η x max {v x T , v y T } α + 2lκη x 1 -α E (v x T ) 1-α + lκη y 1 -β E (v y T ) 1-β + 1 µ + η y (v y 0 ) β 8lκ 2 η x G 2 η y (v y 0 ) α E (v y T ) β + κ 4 µη y C β + 2C 2β (η x ) 2 (η y ) 2 E 1 + log v x T -log v x 0 (v x 0 ) 2α-1 • 1 α≥0.5 + (v x T ) 1-2α 1 -2α • 1 α<0.5 + κ 2 + L 2 G 2 (η x ) 2 µη y (v y 0 ) 2α-β lκ (η x ) 2 (1 -α)η y (v y 0 ) α-β E (v x T ) 1-α + κ 2 (v y 0 ) β G 2 µη y + lκ (2βG) 1 1-β +2 G 2 2µ 1 1-β +3 (η y ) 1 1-β +2 (v y 0 ) 2-2β . It remains to bound (v z T ) m for z ∈ {x, y} and m ≥ 0: (v z T ) m ≤ T G 2 m . Bringing it back, we conclude our proof.

C.4 TIADA WITHOUT ACCESSING OPPONENT'S GRADIENTS

The effective stepsize of x requires the knowledge of gradients of y, i.e., v y t+1 . At the end of Section 3, we discussed the situation when such information is not available. Now we formally introduce the algorithm and present the convergence result. Algorithm 2 TiAda without MAX 1: Input: (x 0 , y 0 ), v x 0 > 0, v y 0 > 0, η x > 0, η y > 0, α > 0, β > 0 and α > β. x t+1 = x t -η x (v x t+1 ) α g x t and y t+1 = P Y y t + η y (v y t+1 ) β g y t 6: end for Theorem C.3 (stochastic). Under Assumptions 3.1, 3.2, 3.4 and 3.5, Algorithm 2 with stochastic gradient oracles satisfies that for any 0 < β < α < 1, after T iterations, E 1 T T -1 t=0 ∥∇ x f (x t , y t )∥ 2 ≤ 2∆ΦG 2α η x T 1-α + 2lκη x G 2-2α (1 -α)T α + (v y 0 ) β G 2 2µ 2 η y + (2βG) 1 1-β +2 G 2 4µ 1 1-β +3 (η y ) 1 1-β +2 (v y 0 ) 2-2β 1 T + η y G 2-2β 2(1 -β)T β + (η x ) 2 κ 2 2 (v y 0 ) β η y + (η x ) 2 κ 2 µ(η y ) 2 1 + log G 2 T -log v x 0 G 4β (v x 0 ) 2α-1 T 1-2β • 1 α≥0.5 + G 2-4α+4β (1 -2α)T 2α-2β • 1 α<0.5 , and E 1 T T -1 t=0 ∥∇ y f (x t , y t )∥ 2 ≤ κ (v y 0 ) β G 2 µη y + 2l (2βG) 1 1-β +2 G 2 4µ 1 1-β +3 (η y ) 1 1-β +2 (v y 0 ) 2-2β 1 T + lη y G 2-2β (1 -β)T β + l (η x ) 2 κ 2 (v y 0 ) β η y + 2 (η x ) 2 κ 3 (η y ) 2 1 + log G 2 T -log v x 0 G 4β (v x 0 ) 2α-1 T 1-2β • 1 α≥0.5 + G 2-4α+4β (1 -2α)T 2α-2β • 1 α<0.5 . Remark C.1. The best rate achievable is O ϵ -6 by choosing α = 1/2 and β = 1/3. Proof. Lemmas C.1 and C.4 can be directly used here because they do not have or expand the effective stepsize of x, i.e., η t . This is also the case for the beginning part of Appendix C.3, the proof of Theorem 3.2, up to Equation ( 12). However, we need to bound Terms (I), (J) and (K) in Equation ( 12) differently. According to our assumption on bounded stochastic gradients, we know that v x T and v y T are both upper bounded by T G 2 , which we will use throughout the proof. Term (I) 2E T -1 t=0 Φ(x t ) -Φ(x t+1 ) η t ≤ 2E Φ(x 0 ) η 0 - Φ(x T ) η T -1 + T -1 t=1 Φ(x t ) 1 η t - 1 η t-1 ≤ 2E Φ max η 0 - Φ * η T -1 + T -1 t=1 Φ max 1 η t - 1 η t-1 = 2E ∆Φ η T -1 = 2E ∆Φ η x (v x T ) α ≤ 2∆ΦG 2α T α η x . Term (J) 2lκ T -1 t=0 E η t ∇ x f (x t , y t ) 2 = 2lκη x E T -1 t=0 1 v x t+1 α ∇ x f (x t , y t ) 2 ≤ 2lκη x E v x 0 (v x 0 ) α + T -1 t=0 1 v x t+1 α ∇ x f (x t , y t ) 2 ≤ 2lκη x 1 -α E (v x T ) 1-α ≤ 2lκη x G 2-2α T 1-α 1 -α . Term (K) According to the smoothness and strong-concavity of f (x t , •), we have , E T -1 t=0 ∥∇ x f (x t , y t ) -∇Φ(x t )∥ 2 ≤ l 2 E T -1 t=0 ∥y t -y * t ∥ 2 ≤ 2lκE where the first term is O (1) according to Lemma C.4. The other two terms can be bounded as follow. Term (L) ≤ E η y 2 v y 0 (v y 0 ) β + T -1 t=0 1 v y t+1 β ∇ y f (x t , y t ) 2 ≤ E η y 2(1 -β) (v y T ) 1-β ≤ η y G 2-2β T 1-β 2(1 -β) . Term (M) = E T -1 t=0 1 v y t+1 β + 2 µη y v y t+1 2β 2η y (1 + λ t ) y * t+1 -y * t 2 ≤ 1 2 (v y 0 ) β η y + 1 µ(η y ) 2 E T -1 t=0 v y t+1 2β y * t+1 -y * t 2 ≤ 1 2 (v y 0 ) β η y + 1 µ(η y ) 2 E (v y T ) 2β T -1 t=0 y * t+1 -y * t 2 ≤ κ 2 2 (v y 0 ) β η y + κ 2 µ(η y ) 2 E (v y T ) 2β T -1 t=0 ∥x t+1 -x t ∥ 2 = (η x ) 2 κ 2 2 (v y 0 ) β η y + (η x ) 2 κ 2 µ(η y ) 2 E (v y T ) 2β T -1 t=0 1 v x t+1 2α ∇ x f (x t , y t ) 2 ≤ (η x ) 2 κ 2 2 (v y 0 ) β η y + (η x ) 2 κ 2 µ(η y ) 2 E (v y T ) 2β 1 + log v x T -log v x 0 (v x 0 ) 2α-1 • 1 α≥0.5 + (v x T ) 1-2α 1 -2α • 1 α<0.5 ≤ (η x ) 2 κ 2 2 (v y 0 ) β η y + (η x ) 2 κ 2 µ(η y ) 2 1 + log G 2 T -log v x 0 G 4β T 2β (v x 0 ) 2α-1 • 1 α≥0.5 + G 2-4α+4β T 1-2α+2β 1 -2α • 1 α<0.5 , where we used the the Lipschitzness of y * (•) in the third inequality. Summarizing all the terms, we finish the proof.



Please refer to Section for formal definitions of initial stepsize and effective stepsize. Note that the initial stepsize ratio, η x /η y , does not necessarily equal to the first effective stepsize ratio, η x 0 /η y 0 .



Figure 1: Comparison between TiAda and vanilla GDA with AdaGrad stepsizes (labeled as Ada-Grad) on the quadratic function (2) with L = 2 under a poor initial stepsize ratio, i.e., η x /η y = 5. Here, η x t and η y t are the effective stepsizes respectively for x and y, and κ is the condition number 1 . (a) shows the trajectory of the two algorithms and the background color demonstrates the function value f (x, y). In (b), while the effective stepsize ratio stays unchanged for AdaGrad, TiAda adapts to the desired time-scale separation 1/κ, which divides the training process into two stages. In (c), after entering Stage II, TiAda converges fast, whereas AdaGrad diverges.

Figure 2: Comparison of algorithms on test functions. r = η x /η y is the initial stepsize ratio. In the first row, we use the quadratic function (2) with L = 2 under deterministic gradient oracles. For the second row, we test the methods on the McCormick function with noisy gradients.

Figure3: Comparison of the algorithms on distributional robustness optimization (5). We use i in the legend to indicate the number of inner loops. Here we present two sets of stepsize configurations for the comparisons of AdaGrad-like and Adam-like algorithms. Please refer to Appendix A.3 for extensive experiments on larger ranges of stepsizes, and it will be shown that TiAda is the best among all stepsize combinations in our grid.

Figure 4: Inception score on WGAN-GP.

and the first row of Figure2, we conduct experiments on problem (2) with L = 2. We use initial stepsize η y = 0.2 and initial point (1, 0.01) for all runs. As for the McCormick function used in the second row of Figure2, we chose η y = 0.01, and the noises added to the gradients are from zero-mean Gaussian distribution with variance 0.01.Distributional Robustness OptimizationFor results shown in Figures3, 6and 7, we adapt code fromLv (2019), and used the same hyper-parameter setting asSinha et al. (2018);Sebbouh et al. (2022), i.e., γ = 1.3. The model we used is a three layer convolutional neural network (CNN) with a final fully-connected layer. For each layer, batch normalization and ELU activation are used. The width of each layer is(32, 64, 128, 512). The setting is the same asSinha et al. (2018); Yang et al.

Figure5: Illustration of the effect of α and β on the two stages in TiAda's time-scale adaptation process. We set β = 1α. The dashed line on the right plot represents the first iteration when the effective stepsize ratio is below 1/κ.

Figure 6: Gradient norms in x of AdaGrad-like algorithms on distributional robustness optimization (5). We use i in the legend to indicate the number of inner loops.

Figure 7: Gradient norms in x of Adam-like algorithms on distributional robustness optimization (5). We use i in the legend to indicate the number of inner loops.

Junchi Yang, Negar Kiyavash, and Niao He. Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33:1153-1165, 2020. Junchi Yang, Xiang Li, and Niao He. Nest your adaptive algorithm for parameter-agnostic nonconvex minimax optimization. NeurIPS, 2022a. Junchi Yang, Antonio Orvieto, Aurelien Lucchi, and Niao He. Faster single-loop algorithms for minimax optimization without strong concavity. In AISTATS, pp. 5485-5517. PMLR, 2022b. Siqi Zhang, Junchi Yang, Cristóbal Guzmán, Negar Kiyavash, and Niao He. The complexity of nonconvex-strongly-concave minimax optimization. In Uncertainty in Artificial Intelligence, pp. 482-492. PMLR, 2021. Xuan Zhang, Necdet Serhat Aybat, and Mert Gurbuzbalaban. Sapd+: An accelerated stochastic method for nonconvex-concave minimax problems. arXiv preprint arXiv:2205.15084, 2022a. Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam can converge without any modification on update rules. arXiv preprint arXiv:2208.09632, 2022b. Dongruo Zhou, Jinghui Chen, Yuan Cao, Yiqi Tang, Ziyan Yang, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.

Stepsize schemes fit in generalized TiAda. See alsoYang et al. (2022a).

y t )⟩ + lκE η t ∇ x f (x t , y t ) -∥∇ x f (x t , y t )∥ 2 + ⟨∇ x f (x t , y t ) -∇Φ(x t ), ∇ x f (x t , y t )⟩ + lκE η t ∇ x f (x t , y t ) ∥∇ x f (x t , y t )∥ 2 + lκE η t ∇ x f (x t , y t ) ∥∇ x f (x t , y t ) -∇Φ(x t )∥ 2 + lκE η t ∇ x f (x t , y t ) ∇ x f (x t , y t ) ∥∇ x f (x t , y t ) -∇Φ(x t )∥

2: for t = 0, 1, 2, ... do 3: sample i.i.d. ξ x t and ξ y t , and let g x t = ∇ x F (x t , y t ; ξ x t ) and g y t = ∇ y F (x t , y t ; ξ y t )

To bound the RHS, we use Young's inequality and have

ACKNOWLEDGMENT

We thank Dr. Anas Barakat and anonymous reviewers for their valuable feedback. The work is supported by ETH research grant and Swiss National Science Foundation (SNSF) Project Funding No. 200021-207343. 

