TIADA: A TIME-SCALE ADAPTIVE ALGORITHM FOR NONCONVEX MINIMAX OPTIMIZATION

Abstract

Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate timescale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvexstrongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.

1. INTRODUCTION

Adaptive gradient methods, such as AdaGrad (Duchi et al., 2011) , Adam (Kingma & Ba, 2015) and AMSGrad (Reddi et al., 2018) , have become the default choice of optimization algorithms in many machine learning applications owing to their robustness to hyper-parameter selection and fast empirical convergence. These advantages are especially prominent in nonconvex regime with success in training deep neural networks (DNN). Classic analyses of gradient descent for smooth functions require the stepsize to be less than 2/l, where l is the smoothness parameter and often unknown for complicated models like DNN. Many adaptive schemes, usually with diminishing stepsizes based on cumulative gradient information, can adapt to such parameters and thus reducing the burden of hyper-parameter tuning (Ward et al., 2020; Xie et al., 2020) . Such tuning-free algorithms are called parameter-agnostic, as they do not require any prior knowledge of problem-specific parameters, e.g., the smoothness or strong-convexity parameter. In this work, we aim to bring the benefits of adaptive stepsizes to solving the following problem: min x∈R d 1 max y∈Y f (x, y) = E ξ∈P [F (x, y; ξ)] , where P is an unknown distribution from which we can drawn i.i.d. samples, Y ⊂ R d2 is closed and convex, and f : R d1 × R d2 → R is nonconvex in x. We call x the primal variable and y the dual variable. This minimax formulation has found vast applications in modern machine learning, notably generative adversarial networks (Goodfellow et al., 2014; Arjovsky et al., 2017) , adversarial learning (Goodfellow et al., 2015; Miller et al., 2020 ), reinforcement learning (Dai et al., 2017; Modi et al., 2021) , sharpness-aware minimization (Foret et al., 2021 ), domain-adversarial training (Ganin et al., 2016) , etc. Albeit theoretically underexplored, adaptive methods are widely deployed in these applications in combination with popular minimax optimization algorithms such as (stochastic) gradient descent ascent (GDA), extragradient (EG) (Korpelevich, 1976) , and optimistic GDA (Popov, 1980; Rakhlin & Sridharan, 2013) ; see, e.g., (Gulrajani et al., 2017; Daskalakis et al., 2018; Mishchenko et al., 2020; Reisizadeh et al., 2020) , just to list a few. While it seems natural to directly extend adaptive stepsizes to minimax optimization algorithms, a recent work by Yang et al. (2022a) pointed out that such schemes may not always converge without knowing problem-dependent parameters. Unlike the case of minimization, convergent analyses of GDA and EG for nonconvex minimax optimization are subject to time-scale separation (Bot ¸& Böhm, 2020; Lin et al., 2020a; Sebbouh et al., 2022; Yang et al., 2022b) -the stepsize ratio of primal and dual variables needs to be smaller than a problem-dependent threshold -which is recently shown to be necessary even when the objective is strongly concave in y with true gradients (Li et al., 2022) . Moreover, Yang et al. (2022a) showed that GDA with standard adaptive stepsizes, that chooses the stepsize of each variable based only on the (moving) average of its own past gradients, fails to adapt to the time-scale separation requirement. Take the following nonconvex-stronglyconcave function as a concrete example: f (x, y) = - 1 2 y 2 + Lxy - L 2 2 x 2 , where L > 0 is a constant. Yang et al. (2022a) proved that directly using adaptive stepsizes like AdaGrad, Adam and AMSGrad will fail to converge if the ratio of initial stepsizes of x and y (denoted as η x and η y ) is large. We illustrate this phenomenon in Figures 1(a ) and 1(c), where AdaGrad diverges. To sum up, adaptive stepsizes designed for minimization, are not time-scale adaptive for minimax optimization and thus not parameter-agnostic. To circumvent this time-scale separation bottleneck, Yang et al. (2022a) introduced an adaptive algorithm, NeAda, for problem (1) with nonconvex-strongly-concave objectives. NeAda is a two-loop algorithm built upon GDmax (Lin et al., 2020a) that after one primal variable update, updates the dual variable for multiple steps until a stopping criterion is satisfied in the inner loop. Although the algorithm is agnostic to the smoothness and strong-concavity parameters, there are several limitations that may undermine its performance in large-scale training: (a) In the stochastic setting, it gradually increases the number of inner loop steps (k steps for the k-th outer loop) to improve the inner maximization problem accuracy, resulting in a possible waste of inner loop updates if the maximization problem is already well solved; (b) NeAda needs a large batchsize of order Ω ϵ -2 to achieve the near-optimal convergence rate in theory; (c) It is not fully adaptive to the gradient noise, since it deploys different strategies for deterministic and stochastic settings. In this work, we address all of the issues above by proposing TiAda (Time-scale Adaptive Algorithm), a single-loop algorithm with time-scale adaptivity for minimax optimization. Specifically, one of our major modifications is setting the effective stepsize, i.e., the scale of (stochastic) gradient used in the updates, of the primal variable to the reciprocal of the maximum between the primal and dual variables' second moments, i.e., the sums of their past gradient norms. This ensures the effective stepsize ratio of x and y being upper bounded by a decreasing sequence, which eventually reaches the desired time-scale separation. Taking the test function (2) as an example, Figure 1 illustrates the time-scale adaptivity of TiAda: In Stage I, the stepsize ratio quickly decreases below the threshold; in Stage II, the ratio is stabilized and the gradient norm starts to converge fast. We focus on the minimax optimization (1) that is strongly-concave in y, since other nonconvex regimes are far less understood even without adaptive stepsizes. Moreover, near stationary point 1 Please refer to Section 2 for formal definitions of initial stepsize and effective stepsize. Note that the initial stepsize ratio, η x /η y , does not necessarily equal to the first effective stepsize ratio, η x 0 /η y 0 .



Figure 1: Comparison between TiAda and vanilla GDA with AdaGrad stepsizes (labeled as Ada-Grad) on the quadratic function (2) with L = 2 under a poor initial stepsize ratio, i.e., η x /η y = 5. Here, η x t and η y t are the effective stepsizes respectively for x and y, and κ is the condition number 1 . (a) shows the trajectory of the two algorithms and the background color demonstrates the function value f (x, y). In (b), while the effective stepsize ratio stays unchanged for AdaGrad, TiAda adapts to the desired time-scale separation 1/κ, which divides the training process into two stages. In (c), after entering Stage II, TiAda converges fast, whereas AdaGrad diverges.

