PROXIMAL GRADIENT DESCENT-ASCENT: VARIABLE CONVERGENCE UNDER KŁ GEOMETRY

Abstract

The gradient descent-ascent (GDA) algorithm has been widely applied to solve minimax optimization problems. In order to achieve convergent policy parameters for minimax optimization, it is important that GDA generates convergent variable sequences rather than convergent sequences of function values or gradient norms. However, the variable convergence of GDA has been proved only under convexity geometries, and there lacks understanding for general nonconvex minimax optimization. This paper fills such a gap by studying the convergence of a more general proximal-GDA for regularized nonconvex-strongly-concave minimax optimization. Specifically, we show that proximal-GDA admits a novel Lyapunov function, which monotonically decreases in the minimax optimization process and drives the variable sequence to a critical point. By leveraging this Lyapunov function and the KŁ geometry that parameterizes the local geometries of general nonconvex functions, we formally establish the variable convergence of proximal-GDA to a critical point x * , i.e., x t → x * , y t → y * (x * ). Furthermore, over the full spectrum of the KŁ-parameterized geometry, we show that proximal-GDA achieves different types of convergence rates ranging from sublinear convergence up to finite-step convergence, depending on the geometry associated with the KŁ parameter. This is the first theoretical result on the variable convergence for nonconvex minimax optimization.

1. INTRODUCTION

Minimax optimization is a classical optimization framework that has been widely applied in various modern machine learning applications, including game theory Ferreira et al. ( 2012 A popular algorithm for solving the above minimax problem is gradient descent-ascent (GDA), which performs a gradient descent update on the variable x and a gradient ascent update on the variable y alternatively in each iteration. Under the alternation between descent and ascent updates, it is much desired that GDA generates sequences of variables that converge to a certain optimal point, i.e., the minimax players obtain convergent optimal policies. In the existing literature, many studies have established the convergence of GDA-type algorithms under various global geometries of the objective function, e.g., convex-concave geometry (f is convex in x and concave in y) Nedić and Ozdaglar ( 2009 2020); Zhang and Wang (2020), under which GDA is shown to generate convergent variable sequences. However, these special global function geometries do not hold for modern machine learning problems that usually have complex models and nonconvex geometry. 1



), generative adversarial networks (GANs) Goodfellow et al. (2014), adversarial training Sinha et al. (2017), reinforcement learning Qiu et al. (2020), imitation learning Ho and Ermon (2016); Song et al. (2018), etc. A typical minimax optimization problem is shown below, where f is a differentiable function.

), bi-linear geometry Neumann (1928); Robinson (1951) and Polyak-Łojasiewicz (PŁ) geometry Nouiehed et al. (2019); Yang et al. (2020). Some other work studied GDA under stronger global geometric conditions of f such as convex-strongly-concave geometry Du and Hu (2019) and strongly-convex-strongly-concave geometry Mokhtari et al. (

