TIADA: A TIME-SCALE ADAPTIVE ALGORITHM FOR NONCONVEX MINIMAX OPTIMIZATION

Abstract

Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate timescale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvexstrongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.

1. INTRODUCTION

Adaptive gradient methods, such as AdaGrad (Duchi et al., 2011) , Adam (Kingma & Ba, 2015) and AMSGrad (Reddi et al., 2018) , have become the default choice of optimization algorithms in many machine learning applications owing to their robustness to hyper-parameter selection and fast empirical convergence. These advantages are especially prominent in nonconvex regime with success in training deep neural networks (DNN). Classic analyses of gradient descent for smooth functions require the stepsize to be less than 2/l, where l is the smoothness parameter and often unknown for complicated models like DNN. Many adaptive schemes, usually with diminishing stepsizes based on cumulative gradient information, can adapt to such parameters and thus reducing the burden of hyper-parameter tuning (Ward et al., 2020; Xie et al., 2020) . Such tuning-free algorithms are called parameter-agnostic, as they do not require any prior knowledge of problem-specific parameters, e.g., the smoothness or strong-convexity parameter. In this work, we aim to bring the benefits of adaptive stepsizes to solving the following problem: min x∈R d 1 max y∈Y f (x, y) = E ξ∈P [F (x, y; ξ)] , where P is an unknown distribution from which we can drawn i.i.d. samples, Y ⊂ R d2 is closed and convex, and f : R d1 × R d2 → R is nonconvex in x. We call x the primal variable and y the dual variable. This minimax formulation has found vast applications in modern machine learning, notably generative adversarial networks (Goodfellow et al., 2014; Arjovsky et al., 2017 ), adversarial learning (Goodfellow et al., 2015; Miller et al., 2020 ), reinforcement learning (Dai et al., 2017; Modi et al., 2021) , sharpness-aware minimization (Foret et al., 2021) , domain-adversarial training (Ganin et al., 2016) , etc. Albeit theoretically underexplored, adaptive methods are widely deployed in these applications in combination with popular minimax optimization algorithms such as (stochastic) gradient descent ascent (GDA), extragradient (EG) (Korpelevich, 1976) , and optimistic GDA (Popov, 1980; Rakhlin & Sridharan, 2013) ; see, e.g., (Gulrajani et al., 2017; Daskalakis et al., 2018; Mishchenko et al., 2020; Reisizadeh et al., 2020) , just to list a few. While it seems natural to directly extend adaptive stepsizes to minimax optimization algorithms, a recent work by Yang et al. (2022a) pointed out that such schemes may not always converge without

