PROXIMAL GRADIENT DESCENT-ASCENT: VARIABLE CONVERGENCE UNDER KŁ GEOMETRY

Abstract

The gradient descent-ascent (GDA) algorithm has been widely applied to solve minimax optimization problems. In order to achieve convergent policy parameters for minimax optimization, it is important that GDA generates convergent variable sequences rather than convergent sequences of function values or gradient norms. However, the variable convergence of GDA has been proved only under convexity geometries, and there lacks understanding for general nonconvex minimax optimization. This paper fills such a gap by studying the convergence of a more general proximal-GDA for regularized nonconvex-strongly-concave minimax optimization. Specifically, we show that proximal-GDA admits a novel Lyapunov function, which monotonically decreases in the minimax optimization process and drives the variable sequence to a critical point. By leveraging this Lyapunov function and the KŁ geometry that parameterizes the local geometries of general nonconvex functions, we formally establish the variable convergence of proximal-GDA to a critical point x * , i.e., x t → x * , y t → y * (x * ). Furthermore, over the full spectrum of the KŁ-parameterized geometry, we show that proximal-GDA achieves different types of convergence rates ranging from sublinear convergence up to finite-step convergence, depending on the geometry associated with the KŁ parameter. This is the first theoretical result on the variable convergence for nonconvex minimax optimization.

1. INTRODUCTION

Minimax optimization is a classical optimization framework that has been widely applied in various modern machine learning applications, including game theory Ferreira et al. (2012) , generative adversarial networks (GANs) Goodfellow et al. (2014) A popular algorithm for solving the above minimax problem is gradient descent-ascent (GDA), which performs a gradient descent update on the variable x and a gradient ascent update on the variable y alternatively in each iteration. Under the alternation between descent and ascent updates, it is much desired that GDA generates sequences of variables that converge to a certain optimal point, i.e., the minimax players obtain convergent optimal policies. In the existing literature, many studies have established the convergence of GDA-type algorithms under various global geometries of the objective function, e.g., convex-concave geometry (f is convex in x and concave in y) Nedić and Ozdaglar (2009), bi-linear geometry Neumann (1928); Robinson (1951) and Polyak-Łojasiewicz (PŁ) geometry Nouiehed et al. (2019); Yang et al. (2020) . Some other work studied GDA under stronger global geometric conditions of f such as convex-strongly-concave geometry Du and Hu (2019) and strongly-convex-strongly-concave geometry Mokhtari et al. (2020) ; Zhang and Wang (2020), under which GDA is shown to generate convergent variable sequences. However, these special global function geometries do not hold for modern machine learning problems that usually have complex models and nonconvex geometry. Recently, many studies characterized the convergence of GDA in nonconvex minimax optimization, where the objective function is nonconvex in x. Specifically, Lin et al. ( 2020 2020b) studied the nonconvex-strongly-concave setting. In these general nonconvex settings, it has been shown that GDA converges to a certain stationary point at a sublinear rate, i.e., G(x t ) ≤ t -α for some α > 0, where G(x t ) corresponds to a certain notion of gradient. Although such a gradient convergence result implies the stability of the algorithm, namely, lim t→∞ x t+1 -x t = 0, it does not guarantee the convergence of the variable sequences {x t } t , {y t } t generated by GDA. So far, the variable convergence of GDA has not been established for nonconvex problems, but only under (strongly) convex function geometries that are mentioned previously Du and Hu (2019); Mokhtari et al. (2020) ; Zhang and Wang (2020) . Therefore, we want to ask the following fundamental question: • Q1: Does GDA have guaranteed variable convergence in nonconvex minimax optimization? If so, where do they converge to? In fact, proving the variable convergence of GDA in the nonconvex setting is highly nontrivial due to the following reasons: 1) the algorithm alternates between a minimization step and a maximization step; 2) It is well understood that strong global function geometry leads to the convergence of GDA. (2016a) . Hence, we are highly motivated to study the convergence rate of variable convergence of GDA in nonconvex minimax optimization under the KŁ geometry. In particular, we want to address the following question: • Q2: How does the local function geometry captured by the KŁ parameter affects the variable convergence rate of GDA? In this paper, we provide comprehensive answers to these questions. We develop a new analysis framework to study the variable convergence of GDA in nonconvex-strongly-concave minimax optimization under the KŁ geometry. We also characterize the convergence rates of GDA in the full spectrum of the parameterization of the KŁ geometry.

1.1. OUR CONTRIBUTIONS

We consider the following regularized nonconvex-strongly-concave minimax optimization problem min x∈R m max y∈Y f (x, y) + g(x) -h(y), (P) where f is a differentiable and nonconvex-strongly-concave function, g is a general nonconvex regularizer and h is a convex regularizer. Both g and h can be possibly nonsmooth. To solve the above regularized minimax problem, we study a proximal-GDA algorithm that leverages the forward-backward splitting update Lions and Mercier (1979); Attouch et al. (2013) . We study the variable convergence property of proximal-GDA in solving the minimax problem (P). Specifically, we show that proximal-GDA admits a novel Lyapunov function H(x, y) (see Proposition 2), which is monotonically decreasing along the trajectory of proximal GDA, i.e., H(x t+1 , y t+1 ) < H(x t , y t ). Based on the monotonicity of this Lyapunov function, we show that every limit point of the variable sequences generated by proximal-GDA is a critical point of the objective function. Moreover, by exploiting the ubiquitous KŁ geometry of the Lyapunov function, we prove that the entire variable sequence of proximal-GDA has a unique limit point, or equivalently speaking, it



, adversarial training Sinha et al. (2017), reinforcement learning Qiu et al. (2020), imitation learning Ho and Ermon (2016); Song et al. (2018), etc. A typical minimax optimization problem is shown below, where f is a differentiable function.

); Nouiehed et al. (2019); Xu et al. (2020b); Boţ and Böhm (2020) studied the convergence of GDA in the nonconvex-concave setting and Lin et al. (2020); Xu et al. (

However, in general nonconvex setting, the objective functions typically do not have an amenable global geometry. Instead, they may satisfy different types of local geometries around the critical points. Hence, it is natural and much desired to exploit the local geometries of functions in analyzing the convergence of GDA. The Kurdyka-Łojasiewicz (KŁ) geometry provides a broad characterization of such local geometries for nonconvex functions.

